Announcements:
|
January 1, 2018: First lecture will be held in EE B 303 on January 3, 2018 (Wednesday) at 3:30pm.
January 3, 2018: Please send email to the instructors (before January 14, 2018) if you are interested to attend the course (credit or audit).
January 16, 2018: HW#1 is due on January 22, 2018 (Monday).
January 29, 2018: HW#2 is due on February 5, 2018 (Monday).
February 14, 2018: Midterm#1 is on February 21, 2018 (Wednesday).
April 1, 2018: Project proposal submission deadline April 4 2018 - one paragraph on your chosen topic and reference paper(s). The project is expected to include repeat of previous reference paper (70%) and novelty (30%).
April 1, 2018: The final project report (Introduction, Work done, Novelty, Data/Experimental setup, Summary, Max 4 pages) is due on May 2, 2018.
April 1, 2018: The final examination will be held in the week of May 1-5; more details will be communicated later.
April 19, 2018: The final examination will be held on April 24, Tuesday at 2pm.
April 19, 2018: The course project presentation is scheduled on May 4 at 2pm.
|
Textbooks:
|
- Fundamentals of speech recognition, Rabiner and Juang, Prentice Hall, 1993.
- Automatic Speech Recognition, A Deep Learning Approach, Authors: Yu, Dong, Deng, Li, Springer, 2014.
- Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, Prentice Hall, 2001.
- Digital Processing of Speech Signals, Lawrence R. Rabiner, Pearson Education, 2008.
- "Automatic Speech Recognition - A deep learning approach" - Dong Yu, Li Deng.
|
Topics covered:
|
Date
|
Topics
|
Remarks
|
Jan 3
|
Course logistics
|
-
|
Jan 8
|
Information in speech, speech chain, speech research - science and technology
|
Introductory lecture
|
Jan 10
|
Phonemes, allophones, diphones, morphemes, lexicon, consonant cluster.
|
IPA, ARPABET
|
Jan 15
|
Summary of phonetics and phonology, manner and place of articulation, intonation, stress, co-articulation, Assimilation, Elision, speech production models, formants, Human auditory system, auditory modeling, Cochlear signal processing.
|
Notes
|
Jan 17
|
Speech perception theories, Fletcher Munson curve, Perceptual unit of loudness, Pitch Perception, Timbre, Masking, critical band, BARK, HRTF, Categorial Perception.
|
Notes
|
Jan 24
|
McGurk Effect, distorted speech perception, Time-varying signal, time-varying system, temporal and frequency resolution, short-time Fourier transform (STFT), properties of STFT, inverse STFT.
|
Notes
|
Jan 29
|
Filtering and Filterbank Interpretation of STFT, Filter Bank Synthesis.
|
Notes
|
Jan 31
|
Overlap Add method, reconstruction from STFT magnitude, Wideband and Narrowband spectrogram, Spectrograms of different sounds -- vowel, fricative, semivowel, nasal, stops.
|
Notes
|
Feb 5
|
Spectrogram reading, formants, pattern playback, Spectrogram reading, weighted overlap add method, spectrogram re-assignment, speech denoising, time-scale modification.
|
Notes ST# 1
|
Feb 7
|
Time-frequency representation, time-bandwidth product, Gabor transform, time-frequency tile, auditory filterbank, auditory filter modeling, wavelet based auditory filter, auditory model.
|
Notes
|
Feb 12
|
homomorphic filtering, cepstrum, properties of cepstrum, uniqueness of cpestrum, Motivation for extraction of excitation of vocal tract response using cepstrum.
|
Notes
|
Feb 14
|
derivation of the cepstrum for all pole-zero transfer function, periodic impulse train, white noise, liftering, homomorphic vocoder, mel-frequency cepstral coefficients.
|
Notes ST# 2
|
Feb 19
|
AM-FM model, non-linear models, signal subspace approach, Sinusoidal model, its applications, Chirp model, short-time chirp transform, mixture Gaussian envelope chirp model, group delay analysis
Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
The evolution of the Lombard effect: 100 years of psychoacoustic research
REVERBERANT SPEECH ENHANCEMENT USJNG CEPSTRAL PROCESSING
Enhancement of Reverberant Speech Using LP Residual Signal
Reverberant Speech Enhancement by Temporal and Spectral Processing
JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME
Acoustic characteristics related to the perceptual pitch in whispered vowels
A Comprehensive Vowel Space for Whispered Speech
FUNDAMENTAL FREQUENCY GENERATION FOR WHISPER-TO-AUDIBLE SPEECH CONVERSION
Silent Communication: whispered speech-to-clear speech conversion
Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease
Seeing Speech: Capturing Vocal Tract Shaping Using Real-Time Magnetic Resonance Imaging
Speech production, syntax comprehension, and cognitive deficits in Parkinson's disease
Speech production knowledge in automatic speech recognition
Knowledge from Speech Production Used in Speech Technology: Articulatory Synthesis
Speech Production and Speech Modelling
|
Notes
|
Feb 21
|
-
|
Midterm# 1
|
Feb 26
|
Introduction to linear prediction (LP), LP as a filtering problem, orthogonality principle, optimal linear predictor, Yule Walker equations, Properties of Autocorrelation matrix. Reference: Chapter 2 of Theory of Linear Prediction
|
Notes
|
Feb 28
|
Relationship between eigenvalues of autocorrelation matrix and power spectrum. Augmented normal equations, Line Spectral Processes. Reference: Chapter 2 of Theory of Linear Prediction
|
-
|
Mar 5
|
Estimation of LP coefficients using Levinson Durbin recursion. Reflection coefficients. Properties of Error Stalling. Definition AR processes. Reference: Chapter 3, 5 of Theory of Linear Prediction
|
-
|
Mar 7
|
Stalling error in linear prediction and spectral flatness, Autoregressive process definition, AR process and relationship with linear prediction. Error whitening. AR approximation of a wide sense stationary sequence. Spectral estimation using AR modeling. Reference: Chapter 5 of Theory of Linear Prediction
|
-
|
Mar 12
|
Time Alignment and normalization of two sequences of variable length. Dynamic Programming Principles - recursive optimization in sequential problems. Introduction to Dynamic time warping. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.
|
Notes
|
Mar 14
|
Dynamic Time Warping - End point constraints, local and global constraints. Optimization algorithm for DTW. Applications of DTW for speech signal processing. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.
|
-
|
Mar 21
|
Introduction to Hidden Markov Models. Definition of HMM. Three Problems in HMM. Likelihood estimation - brute force and forward/backward modeling. Problem of state alignment - Viterbi decoding.
|
Notes
|
Mar 26
|
HMM training. Using Gaussian distribution in HMM states. HMM-DNN modeling. Supervised and Unsupervised learning. Multi-layer perceptrons. Hidden layer activations and output layer non-linearities.
|
-
|
Mar 28
|
Backpropagation in DNNs. Posterior probability estimation in DNNs. HMM-DNN hybrid modeling. Reference: Neural Networks and Pattern Recognition by C. Bishop, 2007, Chapter 4.
|
-
|
Apr 2
|
Non-emitting states in HMMs. Connecting word HMMs. Sequence of HMMs. Decoding with connected words. Reference: http://www.speech.cs.cmu.edu/sphinxman/HMM.pdf
|
Notes
|
Apr 4
|
Language modeling - n-gram and backoff. Recurrent neural networks, feedback and recursive properties. Backpropagation in RNNs. Various recurrent architectures. Reference: "Deep Learning", Ian Goodfellow. Chapter 10.
|
Notes
|
Apr 6
|
Long short term memory (LSTM) networks. Need for end-to-end models. Sequence labeling with connectionist temporal classification (CTC). Reference: Chapter 4, 6 of Supervised sequence labeling with Recurrent Networks, Alex Graves
|
-
|
|