BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//EE - ECPv5.10.0//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:EE
X-ORIGINAL-URL:https://ee.iisc.ac.in
X-WR-CALDESC:Events for EE
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
BEGIN:STANDARD
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
DTSTART:20210101T000000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=Asia/Kolkata:20210929T163000
DTEND;TZID=Asia/Kolkata:20210929T173000
DTSTAMP:20260420T145505
CREATED:20211110T014011Z
LAST-MODIFIED:20211110T014011Z
UID:239071-1632933000-1632936600@ee.iisc.ac.in
SUMMARY:PhD Thesis Defence of Aravind Illa
DESCRIPTION:Thesis Title: Acoustic-Articulatory Mapping: Analysis and Improvements with Neural Network Learning Paradigms \nAbstract:  Human speech is one of many acoustic signals we perceive\, which carries linguistic and paralinguistic (e.g: speaker identity\, emotional state) information. Speech acoustics are produced as a result of different temporally overlapping gestures of speech articulators (such as lips\, tongue tip\, tongue body\, tongue dorsum\, velum\, and larynx) each of which regulates constriction in different parts of the vocal tract. Estimating speech acoustic representations from articulatory movements is known as articulatory-to-acoustic forward (AAF) mapping i.e.\, articulatory speech synthesis. While estimating articulatory movements back from the speech acoustics is known as acoustic-to-articulatory inverse (AAI) mapping. These acoustic-articulatory mapping functions are known to be complex and nonlinear. \nComplexity of this mapping depends on a number of factors. These include the kind of representations used in the acoustic and articulatory spaces. Typically these representations capture both linguistic and paralinguistic aspects in speech. How each of these aspects contributes to the complexity of the mapping is unknown. These representations and\, in turn\, the acoustic-articulatory mapping are affected by the speaking rate as well. The nature and quality of the mapping varies across speakers. Thus\, complexity of mapping also depends on the amount of the data from a speaker as well as number of speakers used in learning the mapping function. Further\, how the language variations impact the mapping requires detailed investigation. This thesis analyzes few of such factors in detail and develops neural network based models to learn mapping functions robust to many of these factors. \nElectromagnetic articulography (EMA) sensor data has been used directly in the past as articulatory representations (ARs) for learning the acoustic-articulatory mapping function. In this thesis\, we address the problem of optimal EMA sensor placement such that the air-tissue boundaries as seen in the mid-sagittal plane of the real-time magnetic resonance imaging (rtMRI) is reconstructed with minimum error. Following optimal sensor placement work\, acoustic-articulatory data was collected using EMA from 41 subjects with speech stimuli in English and Indian native languages (Hindi\, Kannada\, Tamil and Telugu) which resulted in a total of ~23 hours of data\, used in this thesis. Representations from raw waveform are also learnt for AAI task using convolutional and bidirectional long short term memory neural networks (CNN-BLSTM)\, where the learned filters of CNN are found to be similar to those used for computing Mel-frequency cepstral coefficients (MFCCs)\, typically used for AAI task. In order to examine the extent to which a representation having only the linguistic information can recover ARs\, we replace MFCC vectors with one-hot encoded vectors representing phonemes\, which were further modified to remove the time duration of each phoneme and keep only phoneme sequence. Experiments with phoneme sequence using attention network achieve an AAI performance that is identical to that using phoneme with timing information\, while there is a drop in performance compared to that using MFCC. \nExperiments to examine variation in speaking rate reveal that\, the errors in estimating the vertical motion of tongue articulators from acoustics with fast speaking rate\, is significantly higher than those with slow speaking rate. In order to reduce the demand for data from a speaker\, low resource AAI is proposed using a transfer learning approach. Further\, we show that AAI can be modeled to learn acoustic-articulatory mappings of multiple speakers through a single AAI model rather than building separate speaker-specific models. This is achieved by conditioning an AAI model with speaker embeddings\, which benefits AAI in seen and unseen speaker evaluations. Finally\, we show the benefit of estimated ARs in voice conversion application. Experiments revealed that ARs estimated from speaker independent AAI preserves linguistic information and suppress speaker-dependent factors. These ARs (from unseen speaker and language) are used to drive target speaker specific AAF to synthesis speech\, which preserves linguistic information and target speaker’s voice characteristics.
URL:https://ee.iisc.ac.in/event/phd-thesis-defence-of-aravind-illa/
END:VEVENT
END:VCALENDAR