Prasanta Kumar Ghosh

E9 261 (JAN) 3:1 Speech Information Processing

Speech Information Processing
January-April, 2017

Announcements:

January 1, 2017: First lecture will be held in EE B 303 on January 4, 2017 (Wednesday) at 3:30pm.
January 9, 2017: If you are attending this course (credit or audit) please send email to the instructors (with email subject E9_261_2017) indicating your name, SR No, and whether you are crediting or auditing.
January 11, 2017: Lecture of January 16, 2017 will be held in EE B 303 at 11:00am.
January 16, 2017: The HW# 1 is due on January 23, 2017.
January 18, 2017: We need to reschedule the lecture of January 23, 2017. Please respond using doodle poll (sent through email) on your preference of the rescheduled lecture timing.
February 1, 2017: The HW# 2 is due on February 8, 2017.
March 20, 2017: The HW# 3 is due on March 24, 2017.
April 15, 2017: Midterm report (Problem definition, literature review, progress so far, future plan, Max 2 pages) for project is due on April 17, 2017.
April 20, 2017: The final examination will be held on March 24, 2017 in EEB303 from 9:30am to 12:30pm.
April 20, 2017: The final project report (Introduction, Work done, Novelty, Data/Experimental setup, Summary, Max 4 pages) is due on May 2, 2017.
April 20, 2017: The final project slides should be emailed to the instructors on or before May 3, 2017.
April 20, 2017: The final project presentation (Max 6-8 slides, 10 min presentation) will be held on May 4, 2017 from 9:30am to 12:30pm in EEB303.

Instructor:

Prasanta Kumar Ghosh
Office: EE C 330
Phone: +91 (80) 2293 2694
prasantg AT ee.iisc.ernet.in

Sriram Ganapathy
Office: EE C 334
Phone: +91 (80) 2293 2433
sriram AT ee.iisc.ernet.in

Teaching Assistant(s):

Class meetings:

3:30pm to 5:00pm every Monday and Wednesday (Venue: EE B 303)

Course Content:

Speech communication and overview
Time varying signals/sys
Spectrograms and applications
Speech parameterization/representation
AM-FM, sinusoidal models for speech
AR, ARMA, time-varying AR models for speech
Deep learning for speech
Speech applications - recognition, enhancement and coding

Prerequisites:

Digital Signal Processing, Probability and Random Processes

Textbooks:

Fundamentals of speech recognition, Rabiner and Juang, Prentice Hall, 1993.
Automatic Speech Recognition, A Deep Learning Approach, Authors: Yu, Dong, Deng, Li, Springer, 2014.
Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, Prentice Hall, 2001.
Digital Processing of Speech Signals, Lawrence R. Rabiner, Pearson Education, 2008.
"Automatic Speech Recognition - A deep learning approach" - Dong Yu, Li Deng.

Web Links:

The Edinburgh Speech Tools Library
Speech Signal Processing Toolkit (SPTK)
Hidden Markov Model Toolkit (HTK)
ICSI Speech Group Tools
VOICEBOX: Speech Processing Toolbox for MATLAB
Praat: doing phonetics by computer
Audacity
SoX - Sound eXchange
HMM-based Speech Synthesis System (HTS)
International Phonetic Association (IPA)
Type IPA phonetic symbols
CMU dictionary
Co-articulation and phonology by Ohala
Assisted Listening Using a Headset
Headphone-Based Spatial Sound
Pitch Perception
Head-Related Transfer Functions and Virtual Auditory Display
Signal reconstruction from STFT magnitude: a state of the art
On the usefulness of STFT phase spectrum in human listening tests
Experimental comparison between stationary and nonstationary formulations of linear prediction applied to voiced speech analysis
A modified autocorrelation method of linear prediction for pitch-synchronous analysis of voiced speech
Linear prediction: A tutorial review
Energy separation in signal modulations with application to speech analysis
Nonlinear Speech Modeling and Applications

Grading:

Surprise exam. (5 points) - 4 surprise exams. 10 minutes per exam. Each surprise exam is of 5 points. Missed exams earn 0 points. No make-up exams. Average of four surprise examinations will be considered. Class attendance is mandatory. Unexcused absences get an automatic exam score of zero for that session's exam grade.
Assignments (5 points) - 6 assignments. Average of all assignments will be considered. Assignments are meant for learning and preparation for exams. Students may discuss homework problems among themselves but each student must do his or her own work. Cheating or violating academic integrity (see below) will result in failing in the course. Turning in identical homework sets counts as cheating.
Midterm exam. (20 points) - 2 midterm exams. Missed exams earn 0 points. No make-up exams. An average of the midterm scores will be considered.
Final exam. (50 points)
Project (20 points) - Quality/Quantity of work (5 points), Report (5 points), Presentation (5 points), Recording (5 points).

Topics covered:

Date	Topics	Remarks
Jan 4	Course logistics, information in speech, speech chain, speech research - science and technology	Introductory lecture
Jan 9	Phonemes, allophones, diphones, morphemes, lexicon, consonant cluster, IPA, ARPABET.	Notes
Jan 11	Summary of phonetics and phonology, manner and place of articulation, intonation, stress, co-articulation, Assimilation, Elision, speech production models, formants, Human auditory system, auditory modeling, Cochlear signal processing, Speech perception theories, Fletcher Munson curve, Perceptual unit of loudness.	Notes
Jan 16	Pitch Perception, Timbre, Masking, critical band, BARK, HRTF, , distorted speech perception.	Notes HW# 1
Jan 18	Time-varying signal, time-varying system, temporal and frequency resolution, short-time Fourier transform (STFT), properties of STFT, inverse STFT.	Notes ST# 1
Jan 20	Filtering and Filterbank Interpretation of STFT, Filter Bank Synthesis and Introduction to Overlap Add method.	Notes
Jan 30	Overlap Add method, reconstruction from STFT magnitude, Wideband and Narrowband spectrogram, Spectrograms of different sounds -- vowel, fricative, semivowel, nasal, stops, spectrogram reading, formants, pattern playback, Spectrogram reading, weighted overlap add method, spectrogram re-assignment, speech denoising, time-scale modification.	Notes
Feb 1	Time-frequency representation, time-bandwidth product, Gabor transform, time-frequency tile, auditory filterbank, auditory filter modeling, wavelet based auditory filter, auditory model, Time-varying parameters in speech, Using praat to estimate TV parameters.	Notes HW# 2 ST# 2
Feb 6	homomorphic filtering, cepstrum, properties of cepstrum, uniqueness of cpestrum, Motivation for extraction of excitation of vocal tract response using cepstrum.	Notes
Feb 8	derivation of the cepstrum for all pole-zero transfer function, periodic impulse train, white noise, liftering, homomorphic vocoder, mel-frequency cepstral coefficients.	Notes
Feb 10	AM-FM model, non-linear models, signal subspace approach, Sinusoidal model, its applications, Chirp model, short-time chirp transform, mixture Gaussian envelope chirp model, group delay analysis.	Notes
Feb 15	Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise? The evolution of the Lombard effect: 100 years of psychoacoustic research REVERBERANT SPEECH ENHANCEMENT USJNG CEPSTRAL PROCESSING Enhancement of Reverberant Speech Using LP Residual Signal Reverberant Speech Enhancement by Temporal and Spectral Processing JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME Acoustic characteristics related to the perceptual pitch in whispered vowels A Comprehensive Vowel Space for Whispered Speech FUNDAMENTAL FREQUENCY GENERATION FOR WHISPER-TO-AUDIBLE SPEECH CONVERSION Silent Communication: whispered speech-to-clear speech conversion Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease Seeing Speech: Capturing Vocal Tract Shaping Using Real-Time Magnetic Resonance Imaging Speech production, syntax comprehension, and cognitive deficits in Parkinson's disease Speech production knowledge in automatic speech recognition Knowledge from Speech Production Used in Speech Technology: Articulatory Synthesis Speech Production and Speech Modelling
Feb 20	Introduction to linear prediction (LP), LP as a filtering problem, orthogonality principle, optimal linear predictor, Yule Walker equations, Properties of Autocorrelation matrix, line spectral processes. Reference: Chapter 2 of Theory of Linear Prediction	-
Feb 22	-	Midterm# 1
Feb 27	Relationship between eigenvalues of autocorrelation matrix and power spectrum. Augmented normal equations, estimation of LP coefficients using Levinson Durbin recursion. Reflection coefficients. Reference: Chapter 3 of Theory of Linear Prediction	Notes
March 1	Stalling error in linear prediction and spectral flatness, Autoregressive process definition, AR process and relationship with linear prediction. Error whitening. AR approximation of a wide sense stationary sequence. Spectral estimation using AR modeling. Reference: Chapter 5 of Theory of Linear Prediction	ST# 3
March 13	Summary of autocorrelation matching property, AR modeling of speech spectra, Moving average and ARMA model definition. Perceptual linear predictive (PLP) analysis of speech. Basics of pattern recognition and learning. Reference: Chapter 5 of Theory of Linear Prediction, Chapter 4, 5 of Deep Learning	-
March 13	Supervised versus unsupervised learning. Optimization in supervised learning - Gradient descent and stochastic gradient descent. One layer neural network - properties and limitations. Need for hidden layer activations. Choice of non-linearities. Reference: Chapter 6 of Deep Learning	-
March 17	Cost function for Neural network optimization. Neural networks estimate Bayesian posterior probabilities. Universal approximation properties of NN. Need for deep hierachical learning paradigm for complex data. Back propagation learning for deep neural networks. Reference: Chapter 6 of Deep Learning	Slides Reference Paper Reference Book - "Automatic Speech Recognition - Deep Learning Approach", Dong Yu, Li Deng
March 23	Practical considerations in deep learning, model initialization, overfitting-versus-underfitting, validation data. Architectures with convolutions. Advantages of convolutional architectures. Pooling and sub-sampling. Deep convolutional network. Reference: Chapter 9 of Deep Learning	Slides
March 24	Back propagation learning with convolutional architectures and max pooling. Application to speech processing. Recurrent architectures and temporal sequence modeling. Reference: Chapter 10 of Deep Learning	Slides
March 27	Back propagation learning in RNNs (BPTT algorithm). Vanishing gradients in RNNs. Introduction to long-short term memory networks. Architecture of LSTM cell. Unsupervised learning. Formulation of Restricted Boltzmann Machines (RBMs). Reference: Chapter 10 of Deep Learning	Slides
March 29	Restricted Boltzmann Machine - energy function, proof of conditional independence. Gaussian Bernoulli RBM for modeling real visible units. RBMs as initialization for DNNs. Reference: Chapter 20 of Deep Learning	-
March 31	Parameter learning in RBMs, Gradient descent. Need for sampling methods. Gibbs sampling and contrastive divergence. Intuition between maximizing positive and negative partition. Reference: Chapter 18, 19 of Deep Learning	-
April 3	-	Midterm# 2
April 5	Application of RBMs for dimensionality reduction. Other generative models - auto-encoders. Avoiding identity mapping in autoencoders. Denoising auto-encoders. Discussion of assignment implementation of backpropagation learning.	Slides
April 10	Discussion of Mid-term Exam solutions	-
April 12	Time alignment and normalization for sequence comparison. Dynamic programming principle. Distance computation using warped temporal sequences. Meaningful constraints. Application of Linear Predictive Coding and Deep learning for speech coding, enhancement and recogntion. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.	Slides

Your Voice:

Transcripts for recording:

Click here

Academic Honesty:

As students of IISc, we expect you to adhere to the highest standards of academic honesty and integrity.
Please read the IISc academic integrity.