E9 261 (JAN) 3:1 Speech Information Processing



Speech Information Processing
January-April, 2018

Announcements:
January 1, 2018: First lecture will be held in EE B 303 on January 3, 2018 (Wednesday) at 3:30pm.
January 3, 2018: Please send email to the instructors (before January 14, 2018) if you are interested to attend the course (credit or audit).
January 16, 2018: HW#1 is due on January 22, 2018 (Monday).
January 29, 2018: HW#2 is due on February 5, 2018 (Monday).
February 14, 2018: Midterm#1 is on February 21, 2018 (Wednesday).
April 1, 2018: Project proposal submission deadline April 4 2018 - one paragraph on your chosen topic and reference paper(s). The project is expected to include repeat of previous reference paper (70%) and novelty (30%).
April 1, 2018: The final project report (Introduction, Work done, Novelty, Data/Experimental setup, Summary, Max 4 pages) is due on May 2, 2018.
April 1, 2018: The final examination will be held in the week of May 1-5; more details will be communicated later.
April 19, 2018: The final examination will be held on April 24, Tuesday at 2pm.
April 19, 2018: The course project presentation is scheduled on May 4 at 2pm.


Instructor:
Prasanta Kumar Ghosh
Office: EE C 330
Phone: +91 (80) 2293 2694
prasantg AT iisc.ac.in


Sriram Ganapathy
Office: EE C 334
Phone: +91 (80) 2293 2433
sriramg AT iisc.ac.in
Teaching Assistant(s):
  • Purvi Agrawal
    Office: EE C 328
    Phone:
    purvia AT iisc.ac.in


Class meetings:
3:30pm to 5:00pm every Monday and Wednesday (Venue: EE B 303)


Course Content:
  • Speech communication and overview
  • Time varying signals/sys
  • Spectrograms and applications
  • Speech parameterization/representation
  • AM-FM, sinusoidal models for speech
  • AR, ARMA, time-varying AR models for speech
  • Deep learning for speech
  • Speech applications - recognition, enhancement and coding


Prerequisites:
Digital Signal Processing, Probability and Random Processes


Textbooks:
    • Fundamentals of speech recognition, Rabiner and Juang, Prentice Hall, 1993.
    • Automatic Speech Recognition, A Deep Learning Approach, Authors: Yu, Dong, Deng, Li, Springer, 2014.
    • Discrete-Time Speech Signal Processing: Principles and Practice, Thomas F. Quatieri, Prentice Hall, 2001.
    • Digital Processing of Speech Signals, Lawrence R. Rabiner, Pearson Education, 2008.
    • "Automatic Speech Recognition - A deep learning approach" - Dong Yu, Li Deng.


Web Links:
The Edinburgh Speech Tools Library
Speech Signal Processing Toolkit (SPTK)
Hidden Markov Model Toolkit (HTK)
ICSI Speech Group Tools
VOICEBOX: Speech Processing Toolbox for MATLAB
Praat: doing phonetics by computer
Audacity
SoX - Sound eXchange
HMM-based Speech Synthesis System (HTS)
International Phonetic Association (IPA)
Type IPA phonetic symbols
CMU dictionary
Co-articulation and phonology by Ohala
Assisted Listening Using a Headset
Headphone-Based Spatial Sound
Pitch Perception
Head-Related Transfer Functions and Virtual Auditory Display
Signal reconstruction from STFT magnitude: a state of the art
On the usefulness of STFT phase spectrum in human listening tests
Experimental comparison between stationary and nonstationary formulations of linear prediction applied to voiced speech analysis
A modified autocorrelation method of linear prediction for pitch-synchronous analysis of voiced speech
Linear prediction: A tutorial review
Energy separation in signal modulations with application to speech analysis
Nonlinear Speech Modeling and Applications


Grading:
  • Surprise exam. (5 points) - 4 surprise exams. 10 minutes per exam. Each surprise exam is of 5 points. Missed exams earn 0 points. No make-up exams. Average of four surprise examinations will be considered. Class attendance is mandatory. Unexcused absences get an automatic exam score of zero for that session's exam grade.
  • Assignments (5 points) - 6 assignments. Average of all assignments will be considered. Assignments are meant for learning and preparation for exams. Students may discuss homework problems among themselves but each student must do his or her own work. Cheating or violating academic integrity (see below) will result in failing in the course. Turning in identical homework sets counts as cheating.
  • Midterm exam. (20 points) - 2 midterm exams. Missed exams earn 0 points. No make-up exams. An average of the midterm scores will be considered.
  • Final exam. (50 points)
  • Project (20 points) - Quality/Quantity of work (5 points), Report (5 points), Presentation (5 points), Recording (5 points).


Topics covered:
Date
Topics
Remarks
Jan 3
Course logistics
-
Jan 8
Information in speech, speech chain, speech research - science and technology
Introductory lecture
Jan 10
Phonemes, allophones, diphones, morphemes, lexicon, consonant cluster.
IPA, ARPABET
Jan 15
Summary of phonetics and phonology, manner and place of articulation, intonation, stress, co-articulation, Assimilation, Elision, speech production models, formants, Human auditory system, auditory modeling, Cochlear signal processing.
Notes
Jan 17
Speech perception theories, Fletcher Munson curve, Perceptual unit of loudness, Pitch Perception, Timbre, Masking, critical band, BARK, HRTF, Categorial Perception.
Notes
Jan 24
McGurk Effect, distorted speech perception, Time-varying signal, time-varying system, temporal and frequency resolution, short-time Fourier transform (STFT), properties of STFT, inverse STFT.
Notes
Jan 29
Filtering and Filterbank Interpretation of STFT, Filter Bank Synthesis.
Notes
Jan 31
Overlap Add method, reconstruction from STFT magnitude, Wideband and Narrowband spectrogram, Spectrograms of different sounds -- vowel, fricative, semivowel, nasal, stops.
Notes
Feb 5
Spectrogram reading, formants, pattern playback, Spectrogram reading, weighted overlap add method, spectrogram re-assignment, speech denoising, time-scale modification.
Notes
ST# 1
Feb 7
Time-frequency representation, time-bandwidth product, Gabor transform, time-frequency tile, auditory filterbank, auditory filter modeling, wavelet based auditory filter, auditory model.
Notes
Feb 12
homomorphic filtering, cepstrum, properties of cepstrum, uniqueness of cpestrum, Motivation for extraction of excitation of vocal tract response using cepstrum.
Notes
Feb 14
derivation of the cepstrum for all pole-zero transfer function, periodic impulse train, white noise, liftering, homomorphic vocoder, mel-frequency cepstral coefficients.
Notes
ST# 2
Feb 19
AM-FM model, non-linear models, signal subspace approach, Sinusoidal model, its applications, Chirp model, short-time chirp transform, mixture Gaussian envelope chirp model, group delay analysis Speaking in noise: How does the Lombard effect improve acoustic contrasts between speech and ambient noise?
The evolution of the Lombard effect: 100 years of psychoacoustic research
REVERBERANT SPEECH ENHANCEMENT USJNG CEPSTRAL PROCESSING
Enhancement of Reverberant Speech Using LP Residual Signal
Reverberant Speech Enhancement by Temporal and Spectral Processing
JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME
Acoustic characteristics related to the perceptual pitch in whispered vowels
A Comprehensive Vowel Space for Whispered Speech
FUNDAMENTAL FREQUENCY GENERATION FOR WHISPER-TO-AUDIBLE SPEECH CONVERSION
Silent Communication: whispered speech-to-clear speech conversion
Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease
Seeing Speech: Capturing Vocal Tract Shaping Using Real-Time Magnetic Resonance Imaging
Speech production, syntax comprehension, and cognitive deficits in Parkinson's disease
Speech production knowledge in automatic speech recognition
Knowledge from Speech Production Used in Speech Technology: Articulatory Synthesis
Speech Production and Speech Modelling
Notes
Feb 21
-
Midterm# 1
Feb 26
Introduction to linear prediction (LP), LP as a filtering problem, orthogonality principle, optimal linear predictor, Yule Walker equations, Properties of Autocorrelation matrix. Reference: Chapter 2 of Theory of Linear Prediction
Notes
Feb 28
Relationship between eigenvalues of autocorrelation matrix and power spectrum. Augmented normal equations, Line Spectral Processes. Reference: Chapter 2 of Theory of Linear Prediction
-
Mar 5
Estimation of LP coefficients using Levinson Durbin recursion. Reflection coefficients. Properties of Error Stalling. Definition AR processes. Reference: Chapter 3, 5 of Theory of Linear Prediction
-
Mar 7
Stalling error in linear prediction and spectral flatness, Autoregressive process definition, AR process and relationship with linear prediction. Error whitening. AR approximation of a wide sense stationary sequence. Spectral estimation using AR modeling. Reference: Chapter 5 of Theory of Linear Prediction
-
Mar 12
Time Alignment and normalization of two sequences of variable length. Dynamic Programming Principles - recursive optimization in sequential problems. Introduction to Dynamic time warping. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.
Notes
Mar 14
Dynamic Time Warping - End point constraints, local and global constraints. Optimization algorithm for DTW. Applications of DTW for speech signal processing. Reference: Rabiner and Juang, Speech Recognition Text book, Chapter 4.
-
Mar 21
Introduction to Hidden Markov Models. Definition of HMM. Three Problems in HMM. Likelihood estimation - brute force and forward/backward modeling. Problem of state alignment - Viterbi decoding.
Notes
Mar 26
HMM training. Using Gaussian distribution in HMM states. HMM-DNN modeling. Supervised and Unsupervised learning. Multi-layer perceptrons. Hidden layer activations and output layer non-linearities.
-
Mar 28
Backpropagation in DNNs. Posterior probability estimation in DNNs. HMM-DNN hybrid modeling. Reference: Neural Networks and Pattern Recognition by C. Bishop, 2007, Chapter 4.
-
Apr 2
Non-emitting states in HMMs. Connecting word HMMs. Sequence of HMMs. Decoding with connected words. Reference: http://www.speech.cs.cmu.edu/sphinxman/HMM.pdf
Notes
Apr 4
Language modeling - n-gram and backoff. Recurrent neural networks, feedback and recursive properties. Backpropagation in RNNs. Various recurrent architectures. Reference: "Deep Learning", Ian Goodfellow. Chapter 10.
Notes
Apr 6
Long short term memory (LSTM) networks. Need for end-to-end models. Sequence labeling with connectionist temporal classification (CTC). Reference: Chapter 4, 6 of Supervised sequence labeling with Recurrent Networks, Alex Graves
-










Your Voice:
Select file to upload:



Transcripts for recording:
Click here


Academic Honesty:
As students of IISc, we expect you to adhere to the highest standards of academic honesty and integrity.
Please read the IISc academic integrity.