- This event has passed.
Thesis colloquium of Mr. Anoop C. S.
April 10, 2023 @ 9:00 PM - 11:00 PM IST
Advisor : Prof. A. G. Ramakrishnan
Date and Time: 10 April 2023 (Monday) 3:30 PM
TITLE: Automatic speech recognition for low-resource Indian languages
Building good models for automatic speech recognition (ASR) requires large amounts of annotated speech data. Most Indian languages are low-resourced and lack enough training data to build robust and efficient ASR systems. However, many have an overlapping phoneme set and a strong correspondence between their character sets and pronunciations. In this thesis, we exploit such similarities among the Indian languages to improve speech recognition in low-resource settings.
Significant contributions of the thesis:
Exploiting the pronunciation similarities across multiple Indian languages through shared label sets:
We propose the use of a common set of tokens across multiple Indian languages and analyze their performance in mono and multilingual settings.
- We find that the Sanskrit Library Phonetic Encoding (SLP1) tokens, which exploit the pronunciation-based structuring of character Unicodes in Indian languages, perform better than some other grapheme-to-phoneme (G2P) based tokens in monolingual ASR settings.
- Syllable-based sub-words perform better than the character-based token units in monolingual speech recognition. However, character-based SLP1 tokens perform better in cross-lingual transfer.
Strategies for improving the performance of ASR systems in low-resource scenarios (target languages) exploiting the annotated data from high-resource languages (source languages):
We study three different low-resource settings:
A) Labelled audio data is not available in the target language. Only a limited amount of unlabeled data is available. We adopt the unsupervised domain adaptation (UDA) schemes popular in image classification problems to tackle this case.
- The adversarial training with gradient reversal layers (GRL) and domain separation networks (DSN) provides word error rate (WER) improvements of 6.71% and 7.32% in Sanskrit compared to a baseline hybrid DNN-HMM system trained on Hindi.
- The UDA models outperform multi-task training with language recognition as the auxiliary task.
- Selection of the source language is critical in UDA systems.
B) Target language has only a small amount of labeled data and has some amount of text data to build language models. We try to benefit from the available data in high-resource languages through shared label sets to build unified acoustic (AM) and language models (LM).
- Unified language-agnostic AM + LM performs better than monolingual AM + LM in cases where (a) only limited speech data is available for training the acoustic models and (b) the speech data is from domains different from that used in training.
- In general, multilingual AM + monolingual LM performs the best.
C) There are N target languages with limited training data and several source languages with large training sets. Here, we establish the usefulness of model-agnostic meta-learning (MAML) pre-training in Indian languages and propose improvements with text-similarity-based loss-weightings.
- MAML beats joint multilingual pretraining by an average of 5.4% in CER and 20.3% in WER.
- With just 25% of the data, MAML performance matches joint multilingual models trained on the whole target data.
- Similarity with the source languages impacts the target language’s ASR performance.
- We use text-similarity measured through cosine and Mahalanobis distances to weigh the losses during MAML pretraining. It yields a mean absolute improvement of 1% in WER.
ALL ARE WELCOME ONLINE!