BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//EE - ECPv5.10.0//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:EE
X-ORIGINAL-URL:https://ee.iisc.ac.in
X-WR-CALDESC:Events for EE
BEGIN:VTIMEZONE
TZID:Asia/Kolkata
BEGIN:STANDARD
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
TZNAME:IST
DTSTART:20210101T000000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=Asia/Kolkata:20211124T163000
DTEND;TZID=Asia/Kolkata:20211124T173000
DTSTAMP:20260420T145506
CREATED:20211122T233509Z
LAST-MODIFIED:20211122T233826Z
UID:239276-1637771400-1637775000@ee.iisc.ac.in
SUMMARY:PhD Thesis Defense of Mr. Jitendra Kumar Dhiman
DESCRIPTION:Date and Time: November 24\, 2021\,  11 AM. \nClick here to join the meeting \nTitle of the thesis: Spectrotemporal Processing of Speech Signals Using the Riesz Transform \nExaminer: Prof. S. R. Mahadeva Prasanna\, IIT Dharwad and IIT Guwahati \nAbstract: Speech signals have time-varying spectra. Spectrograms have served as a useful tool for the visualization and analysis of speech signals in the joint time-frequency plane. In this thesis\, we consider 2-D analysis of speech spectrograms. We consider a spectrotemporal patch and model it as a 2-D amplitude-modulated and frequency-modulated (AM-FM) sinusoid. Demodulation of the spectrogram yields the 2-D AM and FM components\, which correspond to the slowly varying vocal-tract envelope and the excitation\, respectively. For solving the demodulation problem\, we rely on the complex Riesz transform\, which is a 2-D extension of the 1-D Hilbert transform. The demodulation viewpoint brings forth many interesting properties of the speech signal. The spectrotemporal carrier helps us identify the regions that are coherent and those that are not. Based on this idea\, we introduce the coherencegram corresponding to a given spectrogram. The temporal evolution of the pitch harmonics can also be characterized by the orientation at each time-frequency coordinate\, resulting in the orientationgram. We show that these features collectively enable solutions for the important problems of voiced/unvoiced segmentation\, aperiodicity estimation\, periodic/aperiodic signal separation\, and pitch tracking. We compare the performance of the proposed methods with benchmark methods. The spectrotemporal amplitude characterizes the time-varying magnitude response of the vocal-tract filter. We show how the formants and their bandwidths manifest in the spectrotemporal amplitude. It turns out that the formant bandwidths are mildly overestimated\, which are perceptible when one performs speech synthesis using the estimated parameters. We propose a method for correcting the formant bandwidths\, which also restores the speech quality. Finally\, we use the curated spectrotemporal amplitude\, pitch\, aperiodicity\, and voiced/unvoiced decisions for the task of speech reconstruction in a spectral synthesis model and a neural vocoder\, namely\, WaveNet. We show that conditioning WaveNet on the spectrotemporal features results in high-quality speech synthesis. The quality of the synthesized speech is assessed using both objective and subjective measures. \nWe rely on the Perceptual Evaluation of Speech Quality (PESQ) measure and standard Mean Opinion Score (MOS) test for objective and subjective evaluation\, respectively. The performance of the proposed parameters is evaluated in a vocoder framework that uses the spectral synthesis model for speech reconstruction. The objective evaluation shows that the performance of the Riesz transform-based speech parameters is on par with the baseline systems. Using the spectral synthesis model\, we report an average PESQ score in the range from 2.30 to 3.45 over a total of 200 speech waveforms taken from the CMU-ARCTIC database comprising both male and female speakers. In comparison\, WaveNet-based speech reconstruction gave an average PESQ score of 3.65. \nSubjective evaluation was carried out through listening tests conducted in an acoustic test chamber on volunteers in the age group of 21 to 30. The average MOS score was 4.30 when the Riesz transform-based features were used in WaveNet for speech reconstruction\, which was also comparable with the baseline systems: STRAIGHT and WORLD. Both objective and subjective evaluations also showed that the quality of reconstructed speech waveforms was superior with the proposed features in a WaveNet vocoder than in the spectral synthesis model. \n An audio demonstration is available at the GitHub link: http://jitendradhiman.github.io/Demo \nBiography of Jitendra Kumar Dhiman: Jitendra Kumar Dhiman received his B.Tech. degree in Electronics and Telecommunication Engineering from the Institution of Electronics and Telecommunication Engineering\, Delhi\, India\, in 2010\, and M.Tech. degree in Signal Processing from Indian Institute of Technology Hyderabad in 2013. Subsequently\, he joined as a project assistant in Spectrum Lab (EE Department\, IISc) and worked on prosody modification of speech signals\, and then as a PhD student working on spectrotemporal models for speech processing. His research interests include speech and audio signal processing and machine learning. He will soon be joining Samsung Research and Innovation\, Bangalore (SRIB) as Chief Engineer.
URL:https://ee.iisc.ac.in/event/phd-thesis-defense-of-mr-jitendra-kumar-dhiman/
END:VEVENT
END:VCALENDAR