Announcements:
|
January 1, 2017: First lecture will be held in EE B 303 on January 4, 2017 (Wednesday) at 3:30pm.
January 9, 2017: If you are attending this course (credit or audit) please send email to the instructors (with email subject E9_261_2017) indicating your name, SR No, and whether you are crediting or auditing.
January 11, 2017: Lecture of January 16, 2017 will be held in EE B 303 at 11:00am.
January 16, 2017: The HW# 1 is due on January 23, 2017.
January 18, 2017: We need to reschedule the lecture of January 23, 2017. Please respond using doodle poll (sent through email) on your preference of the rescheduled lecture timing.
February 1, 2017: The HW# 2 is due on February 8, 2017.
March 20, 2017: The HW# 3 is due on March 24, 2017.
April 15, 2017: Midterm report (Problem definition, literature review, progress so far, future plan, Max 2 pages) for project is due on April 17, 2017.
April 20, 2017: The final examination will be held on March 24, 2017 in EEB303 from 9:30am to 12:30pm.
April 20, 2017: The final project report (Introduction, Work done, Novelty, Data/Experimental setup, Summary, Max 4 pages) is due on May 2, 2017.
April 20, 2017: The final project slides should be emailed to the instructors on or before May 3, 2017.
April 20, 2017: The final project presentation (Max 6-8 slides, 10 min presentation) will be held on May 4, 2017 from 9:30am to 12:30pm in EEB303.
|
Textbooks:
|
- Fundamentals of speech recognition, Rabiner and Juang, Prentice Hall, 1993.
- Automatic Speech Recognition, A Deep Learning Approach, Authors: Yu, Dong, Deng, Li, Springer, 2014.
- Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, Prentice Hall, 2001.
- Digital Processing of Speech Signals, Lawrence R. Rabiner, Pearson Education, 2008.
- "Automatic Speech Recognition - A deep learning approach" - Dong Yu, Li Deng.
|
Topics covered:
|
Date
|
Topics
|
Remarks
|
Jan 4
|
Course logistics, information in speech, speech chain, speech research - science and technology
|
Introductory lecture
|
Jan 9
|
Phonemes, allophones, diphones, morphemes, lexicon, consonant cluster, IPA, ARPABET.
|
Notes
|
Jan 11
|
Summary of phonetics and phonology, manner and place of articulation, intonation, stress, co-articulation, Assimilation, Elision, speech production models, formants, Human auditory system, auditory modeling, Cochlear signal processing, Speech perception theories, Fletcher Munson curve, Perceptual unit of loudness.
|
Notes
|
Jan 16
|
Pitch Perception, Timbre, Masking, critical band, BARK, HRTF, , distorted speech perception.
|
Notes HW# 1
|
Jan 18
|
Time-varying signal, time-varying system, temporal and frequency resolution, short-time Fourier transform (STFT), properties of STFT, inverse STFT.
|
Notes ST# 1
|
Jan 20
|
Filtering and Filterbank Interpretation of STFT, Filter Bank Synthesis and Introduction to Overlap Add method.
|
Notes
|
Jan 30
|
Overlap Add method, reconstruction from STFT magnitude, Wideband and Narrowband spectrogram, Spectrograms of different sounds -- vowel, fricative, semivowel, nasal, stops, spectrogram reading, formants, pattern playback, Spectrogram reading, weighted overlap add method, spectrogram re-assignment, speech denoising, time-scale modification.
|
Notes
|
Feb 1
|
Time-frequency representation, time-bandwidth product, Gabor transform, time-frequency tile, auditory filterbank, auditory filter modeling, wavelet based auditory filter, auditory model, Time-varying parameters in speech, Using praat to estimate TV parameters.
|
Notes HW# 2 ST# 2
|
Feb 6
|
homomorphic filtering, cepstrum, properties of cepstrum, uniqueness of cpestrum, Motivation for extraction of excitation of vocal tract response using cepstrum.
|
Notes
|
Feb 8
|
derivation of the cepstrum for all pole-zero transfer function, periodic impulse train, white noise, liftering, homomorphic vocoder, mel-frequency cepstral coefficients.
|
Notes
|
Feb 10
|
AM-FM model, non-linear models, signal subspace approach, Sinusoidal model, its applications, Chirp model, short-time chirp transform, mixture Gaussian envelope chirp model, group delay analysis.
|
Notes
|
Feb 15
|
Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
The evolution of the Lombard effect: 100 years of psychoacoustic research
REVERBERANT SPEECH ENHANCEMENT USJNG CEPSTRAL PROCESSING
Enhancement of Reverberant Speech Using LP Residual Signal
Reverberant Speech Enhancement by Temporal and Spectral Processing
JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME
Acoustic characteristics related to the perceptual pitch in whispered vowels
A Comprehensive Vowel Space for Whispered Speech
FUNDAMENTAL FREQUENCY GENERATION FOR WHISPER-TO-AUDIBLE SPEECH CONVERSION
Silent Communication: whispered speech-to-clear speech conversion
Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease
Seeing Speech: Capturing Vocal Tract Shaping Using Real-Time Magnetic Resonance Imaging
Speech production, syntax comprehension, and cognitive deficits in Parkinson's disease
Speech production knowledge in automatic speech recognition
Knowledge from Speech Production Used in Speech Technology: Articulatory Synthesis
Speech Production and Speech Modelling
|
|
Feb 20
|
Introduction to linear prediction (LP), LP as a filtering problem, orthogonality principle, optimal linear predictor, Yule Walker equations, Properties of Autocorrelation matrix, line spectral processes. Reference: Chapter 2 of Theory of Linear Prediction
|
-
|
Feb 22
|
-
|
Midterm# 1
|
Feb 27
|
Relationship between eigenvalues of autocorrelation matrix and power spectrum. Augmented normal equations, estimation of LP coefficients using Levinson Durbin recursion. Reflection coefficients. Reference: Chapter 3 of Theory of Linear Prediction
|
Notes
|
March 1
|
Stalling error in linear prediction and spectral flatness, Autoregressive process definition, AR process and relationship with linear prediction. Error whitening. AR approximation of a wide sense stationary sequence. Spectral estimation using AR modeling. Reference: Chapter 5 of Theory of Linear Prediction
|
ST# 3
|
March 13
|
Summary of autocorrelation matching property, AR modeling of speech spectra, Moving average and ARMA model definition. Perceptual linear predictive (PLP) analysis of speech. Basics of pattern recognition and learning. Reference: Chapter 5 of Theory of Linear Prediction, Chapter 4, 5 of Deep Learning
|
-
|
March 13
|
Supervised versus unsupervised learning. Optimization in supervised learning - Gradient descent and stochastic gradient descent. One layer neural network - properties and limitations. Need for hidden layer activations. Choice of non-linearities. Reference: Chapter 6 of Deep Learning
|
-
|
March 17
|
Cost function for Neural network optimization. Neural networks estimate Bayesian posterior probabilities. Universal approximation properties of NN. Need for deep hierachical learning paradigm for complex data. Back propagation learning for deep neural networks. Reference: Chapter 6 of Deep Learning
|
Slides
Reference Paper
Reference Book - "Automatic Speech Recognition - Deep Learning Approach", Dong Yu, Li Deng
|
March 23
|
Practical considerations in deep learning, model initialization, overfitting-versus-underfitting, validation data. Architectures with convolutions. Advantages of convolutional architectures. Pooling and sub-sampling. Deep convolutional network. Reference: Chapter 9 of Deep Learning
|
Slides
|
March 24
|
Back propagation learning with convolutional architectures and max pooling. Application to speech processing. Recurrent architectures and temporal sequence modeling. Reference: Chapter 10 of Deep Learning
|
Slides
|
March 27
|
Back propagation learning in RNNs (BPTT algorithm). Vanishing gradients in RNNs. Introduction to long-short term memory networks. Architecture of LSTM cell. Unsupervised learning. Formulation of Restricted Boltzmann Machines (RBMs). Reference: Chapter 10 of Deep Learning
|
Slides
|
March 29
|
Restricted Boltzmann Machine - energy function, proof of conditional independence. Gaussian Bernoulli RBM for modeling real visible units. RBMs as initialization for DNNs. Reference: Chapter 20 of Deep Learning
|
-
|
March 31
|
Parameter learning in RBMs, Gradient descent. Need for sampling methods. Gibbs sampling and contrastive divergence. Intuition between maximizing positive and negative partition. Reference: Chapter 18, 19 of Deep Learning
|
-
|
April 3
|
-
|
Midterm# 2
|
April 5
|
Application of RBMs for dimensionality reduction. Other generative models - auto-encoders. Avoiding identity mapping in autoencoders. Denoising auto-encoders. Discussion of assignment implementation of backpropagation learning.
|
Slides
|
April 10
|
Discussion of Mid-term Exam solutions
|
-
|
April 12
|
Time alignment and normalization for sequence comparison. Dynamic programming principle. Distance computation using warped temporal sequences. Meaningful constraints. Application of Linear Predictive Coding and Deep learning for speech coding, enhancement and recogntion. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.
|
Slides
|
|