- This event has passed.
[Thesis Defense Talk – Shreyas Ramoji, 28/7 @430pm, MMCR, EE] – “Supervised Learning Approaches for Language and Speaker Recognition”
July 28 @ 4:30 PM - 5:30 PM IST
Thesis Defense Talk
Venue: MMCR, EE
Time: 4:30pm [High Tea at 4:15pm]
Speaker: Shreyas Ramoji
Title: Supervised Learning Approaches for Language and Speaker Recognition
In the age of artificial intelligence, one of the important goals of the speech processing research community is to enable machines to automatically recognize who is speaking and in what language.
In the first part of this talk, I will discuss our efforts towards a supervised version of the generative model based embedding extractor for speaker and language recognition. We call the embeddings from this supervised approach as s-vectors. In this approach, a database of speech recordings (in the form of a sequence of short-term feature vectors) is modeled with a Gaussian Mixture Model, called the Universal Background Model (GMM-UBM). The deviation in the mean components is captured in a lower dimensional latent space, called the i-vector space, using a factor analysis framework. In our research, we propose a fully supervised version of the i-vector model, where each label class is associated with a Gaussian prior with a class-specific mean parameter. The joint prior (marginalized over the sample space of classes) on the latent variable becomes a GMM. With detailed data analysis and visualization, we show that the s-vector features yield representations that succinctly capture the language (accent) label information and also perform significantly improved the recognition of various accents of the same language.
In the second part of the talk, I will discuss our efforts for the problem of fully supervised end-to-end speaker verification, where a binary decision has to be made whether a pair of recordings belong to the same speaker or not. We proposed a neural network approach for back-end modeling, where the likelihood ratio score of the generative probabilistic discriminative analysis (PLDA) model is posed as a discriminative similarity function, and the learnable parameters of the score function are optimized using a verification cost. The proposed model, termed as neural PLDA (NPLDA), is initialized using the generative PLDA model parameters. The loss function for the NPLDA model is an approximation of the minimum detection cost function (DCF) used as one of the evaluation metrics in various speaker verification challenges. The speaker recognition experiments using the NPLDA model are performed on the speaker verification task in the VOiCES datasets as well as the SITW challenge dataset. Further, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This Siamese neural network (E2E-NPLDA) model combines embedding extraction and back-end modeling into a single processing pipeline. Several speaker recognition experiments were performed on benchmark datasets where the proposed N E2E-NPLDA models are shown to improve significantly over the then state-of-art system.
I will conclude the talk by highlighting some of the noteworthy approaches that were published during the course of this research work, and identifying some important research directions related to this thesis that can be pursued in the future.
Shreyas Ramoji is a Research Associate at the Learning and Extraction of Acoustic Patterns (LEAP) Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bengaluru. He obtained his Bachelor of Engineering degree from the Department of Electronics and Communication Engineering, PES Institute of Technology, Bangalore South Campus. His research interests include speaker and language recognition, diarization, representation learning for multilingual and conversational speech, ML/AI applied to healthcare and the environment, natural language processing, explainability and interpretability of neural networks, and neuro-symbolic AI.