- This event has passed.
Ph.D. Thesis Oral Defense of Mr. Anoop C. S.: Automatic speech recognition for low-resource Indian languages
August 11, 2023 @ 3:00 PM - 5:00 PM IST
Name of the student: ANOOP C. S.
Advisor: Prof. A. G. Ramakrishnan & Dr. G. N. Rathna
External examiner: Prof. Umesh S, Dept of EE, IIT Madras
Date and Time: 11 August 2023 (Friday) 3:00 PM
Venue (hybrid): MMCR, C 241, First Floor, Dept. of EE
AND
Microsoft Teams meeting link:
teams.microsoft.com
|
TITLE: Automatic speech recognition for low-resource Indian languages
Building good models for automatic speech recognition (ASR) requires large amounts of annotated speech data. Most Indian languages are low-resourced and lack enough training data to build robust and efficient ASR systems. However, many have an overlapping phoneme set and a strong correspondence between their character sets and pronunciations. This thesis exploits such similarities among the Indian languages to improve speech recognition in low-resource settings.
Significant contributions of the thesis:
Exploiting the pronunciation similarities across multiple Indian languages through shared label sets:
The use of a common set of tokens is proposed across multiple Indian languages and their performance analyzed in mono and multilingual settings.
- It is found that the Sanskrit Library Phonetic Encoding (SLP1) tokens, which exploit the pronunciation-based structuring of character Unicodes in Indian languages, perform better than other grapheme-to-phoneme (G2P) based tokens in monolingual ASR settings.
- Syllable-based sub-words perform better than the character-based token units in monolingual speech recognition. However, character-based SLP1 tokens perform better in cross-lingual transfer.
Strategies for improving the performance of ASR systems in low-resource scenarios (target languages) exploiting the annotated data from high-resource languages (source languages):
Three different low-resource settings have been studied:
A) Labelled audio data is not available in the target language. Only a limited amount of unlabeled data is available. Unsupervised domain adaptation (UDA) schemes popular in image classification problems have been adopted to tackle this case.
- The adversarial training with gradient reversal layers (GRL) and domain separation networks (DSN) provide word error rate (WER) improvements of 6.71% and 7.32%, respectively, on Sanskrit compared to a baseline hybrid DNN-HMM system trained on Hindi.
- The UDA models outperform multi-task training with language recognition as the auxiliary task.
- Selection of the source language is critical in UDA systems.
B) Target language has only a small amount of labeled speech data and has some amount of text data to build language models. In this case, available data in high-resource languages is used through shared label sets to build unified acoustic (AM) and language models (LM).
- Unified language-agnostic AM + LM performs better than monolingual AM + LM in cases where (a) only limited speech data is available for training the acoustic models and (b) the test speech data is from domains different from that used in training.
- In general, multilingual AM + monolingual LM performs the best.
C) There are N target languages with limited training data and several source languages with large training sets. In this case, the usefulness of model-agnostic meta-learning (MAML) pre-training is established for Indian languages and improvements are proposed with text-similarity-based loss-weightings.
- MAML beats joint multilingual pretraining by an average of 5.4% in CER and 20.3% in WER.
- With just 25% of the data, MAML performance matches joint multilingual models trained on the whole target data.
- Similarity with the source languages impacts the target language’s ASR performance.
- Text-similarity measured through cosine and Mahalanobis distances is used to weigh the losses during MAML pretraining. It yields a mean absolute improvement of 1% in WER.
ALL ARE WELCOME ONLINE!
Meeting Recording