- This event has passed.
Thesis Colloquium of Mr. Shreyas Ramoji @3pm
October 7, 2022 @ 3:00 pm - 4:00 pm UTC+0
Date: October 7th, Friday, 3-4pm
Degree Registered: PhD
Venue: EE, MMCR [1st Floor, C241] and in Microsoft Teams at https://tinyurl.com/2rfbs7ke
Thesis Title: Supervised Learning Approaches for Language and Speaker Recognition
Abstract: In the age of artificial intelligence, one of the important goals of the research community is to get machines to automatically figure out who is speaking and in what language – a task that humans are naturally capable of. Developing algorithms that automatically infer the speaker, language, or accent from a given segment of speech are challenging tasks for machines and has been a topic of research for at least three decades. While most of the prior successes have been through the development of unsupervised embedding extractors, the main aim of this doctoral research is to propose novel supervised approaches for robust speaker and language recognition.
In the first part of this talk, we propose a supervised version of a popular embedding extraction approach called the i-vector. The i-vector is a popular technique for front-end embedding extraction in speaker and language recognition. In this approach, a database of speech recordings (in the form of a sequence of short-term feature vectors) is modeled with a Gaussian Mixture Model, called the Universal Background Model (GMM-UBM). The deviation in the mean components is captured in a lower dimensional latent space called the i-vector space using a factor analysis framework. In our work, we proposed a fully supervised version of the i-vector model, where each label class is associated with a Gaussian prior with a class-specific mean parameter. The joint prior (marginalized over the sample space of classes) on the latent variable becomes a GMM. The choice of prior is motivated by the Gaussian back-end, where the conventional i-vectors for each language are modeled with a single Gaussian distribution. With detailed data analysis and visualization, we showed that the supervised i-vector (s-vector) features yield representations succinctly capture the language (accent) label information and do a significantly better job distinguishing the various accents of the same language. We performed language recognition experiments on the NIST Language Recognition Evaluation (LRE) 2017 challenge dataset, which has test segments ranging from 3 to 30 seconds. With the s-vector framework, we observe relative improvements between 8% to 20% in terms of the Bayesian detection cost function, 4% to 24% in terms of EER, and 9% to 18% in terms of classification accuracy over the conventional i-vector framework. We also perform language recognition experiments showing similar improvements on the RATS dataset and Mozilla Common Voice dataset, and speaker classification experiments using LibriSpeech.
In the second part of the talk, we explore the problem of speaker verification, where a binary decision has to be made on a test speech segment as to whether it is spoken by a target speaker or not, based on a limited duration of enrollment speech. The state-of-the-art approach to speaker verification was to extract fixed-dimensional embeddings from speech of arbitrary duration and train a back-end generative model called the Probabilistic Linear Discriminant Analysis (PLDA) which was used to make decisions using a Bayesian decision framework. We proposed a neural network approach for back-end modeling, where the likelihood ratio score of the generative PLDA model is posed as a discriminative similarity function, and the learnable parameters of the score function are optimized using a verification cost. The proposed model, termed as neural PLDA (NPLDA), is initialized using the generative PLDA model parameters. The loss function for the NPLDA model is an approximation of the minimum detection cost function (DCF) used as one of the evaluation metrics in various speaker verification challenges. Further, we explore a fully neural approach where the neural model outputs the verification score directly, given the acoustic feature inputs. This Siamese neural network (SiamNN) model combines embedding extraction and back-end modeling into a single processing pipeline. The development of the single neural Siamese model allows the joint optimization of all the modules using a verification cost. We provide a detailed analysis of the influence of hyper-parameters, choice of loss functions, and data sampling strategies for training these models. Several speaker recognition experiments were performed using Speakers in the Wild (SITW), VOiCES, and NIST SRE datasets where the proposed NPLDA and SiamNN models are shown to improve over the state-of-art significantly.
We conclude the talk by highlighting some of the noteworthy approaches that were published during the course of this research work and identifying some important future directions that can be explored.
Bio: Shreyas Ramoji is a Ph.D. scholar at the Learning and Extraction of Acoustic Patterns (LEAP) Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bengaluru. He obtained his Bachelor of Engineering degree from the Department of Electronics and Communication Engineering, PES Institute of Technology, Bangalore South Campus in 2016. He is a student member of the IEEE Signal Processing Society and ISCA. His research interests include Speaker Verification, Language and Accent Identification, Neuroscience, Machine learning, and Artificial Intelligence.
All are welcome.