- This event has passed.
PhD Thesis Defense of Mr. Jitendra Kumar Dhiman
November 24, 2021 @ 4:30 PM - 5:30 PM IST
Date and Time: November 24, 2021, 11 AM.
Click here to join the meeting
Title of the thesis: Spectrotemporal Processing of Speech Signals Using the Riesz Transform
Examiner: Prof. S. R. Mahadeva Prasanna, IIT Dharwad and IIT Guwahati
Abstract: Speech signals have time-varying spectra. Spectrograms have served as a useful tool for the visualization and analysis of speech signals in the joint time-frequency plane. In this thesis, we consider 2-D analysis of speech spectrograms. We consider a spectrotemporal patch and model it as a 2-D amplitude-modulated and frequency-modulated (AM-FM) sinusoid. Demodulation of the spectrogram yields the 2-D AM and FM components, which correspond to the slowly varying vocal-tract envelope and the excitation, respectively. For solving the demodulation problem, we rely on the complex Riesz transform, which is a 2-D extension of the 1-D Hilbert transform. The demodulation viewpoint brings forth many interesting properties of the speech signal. The spectrotemporal carrier helps us identify the regions that are coherent and those that are not. Based on this idea, we introduce the coherencegram corresponding to a given spectrogram. The temporal evolution of the pitch harmonics can also be characterized by the orientation at each time-frequency coordinate, resulting in the orientationgram. We show that these features collectively enable solutions for the important problems of voiced/unvoiced segmentation, aperiodicity estimation, periodic/aperiodic signal separation, and pitch tracking. We compare the performance of the proposed methods with benchmark methods. The spectrotemporal amplitude characterizes the time-varying magnitude response of the vocal-tract filter. We show how the formants and their bandwidths manifest in the spectrotemporal amplitude. It turns out that the formant bandwidths are mildly overestimated, which are perceptible when one performs speech synthesis using the estimated parameters. We propose a method for correcting the formant bandwidths, which also restores the speech quality. Finally, we use the curated spectrotemporal amplitude, pitch, aperiodicity, and voiced/unvoiced decisions for the task of speech reconstruction in a spectral synthesis model and a neural vocoder, namely, WaveNet. We show that conditioning WaveNet on the spectrotemporal features results in high-quality speech synthesis. The quality of the synthesized speech is assessed using both objective and subjective measures.
We rely on the Perceptual Evaluation of Speech Quality (PESQ) measure and standard Mean Opinion Score (MOS) test for objective and subjective evaluation, respectively. The performance of the proposed parameters is evaluated in a vocoder framework that uses the spectral synthesis model for speech reconstruction. The objective evaluation shows that the performance of the Riesz transform-based speech parameters is on par with the baseline systems. Using the spectral synthesis model, we report an average PESQ score in the range from 2.30 to 3.45 over a total of 200 speech waveforms taken from the CMU-ARCTIC database comprising both male and female speakers. In comparison, WaveNet-based speech reconstruction gave an average PESQ score of 3.65.
Subjective evaluation was carried out through listening tests conducted in an acoustic test chamber on volunteers in the age group of 21 to 30. The average MOS score was 4.30 when the Riesz transform-based features were used in WaveNet for speech reconstruction, which was also comparable with the baseline systems: STRAIGHT and WORLD. Both objective and subjective evaluations also showed that the quality of reconstructed speech waveforms was superior with the proposed features in a WaveNet vocoder than in the spectral synthesis model.
An audio demonstration is available at the GitHub link: http://jitendradhiman.github.io/Demo
Biography of Jitendra Kumar Dhiman: Jitendra Kumar Dhiman received his B.Tech. degree in Electronics and Telecommunication Engineering from the Institution of Electronics and Telecommunication Engineering, Delhi, India, in 2010, and M.Tech. degree in Signal Processing from Indian Institute of Technology Hyderabad in 2013. Subsequently, he joined as a project assistant in Spectrum Lab (EE Department, IISc) and worked on prosody modification of speech signals, and then as a PhD student working on spectrotemporal models for speech processing. His research interests include speech and audio signal processing and machine learning. He will soon be joining Samsung Research and Innovation, Bangalore (SRIB) as Chief Engineer.