Department of Electrical Engineering / Columbia University

# Speech Recognition

## Overview

The first portion of the course will cover fundamental topics in speech recognition: signal processing, Gaussian mixture distributions, the Expectation-Maximization algorithm, deep neural networks, hidden Markov models, pronunciation modeling, decision trees, language modeling, finite-state transducers, and search. Topics will be covered in sufficient detail for students to be able to implement a basic large vocabulary speech recognizer.

In the remainder of the course, selected topics from the current state of the art will be discussed. We will cover several key areas in more depth and survey some advanced topics, including acoustic adaptation, discriminative training, and maximum entropy models.

## Prerequisites

The course assumes a knowledge of basic probability and statistics. Knowledge of digital signal processing (ELEN E4810) is helpful but not required. In addition, there will be several programming assignments in C++. Only basic features of C++ will be used, so while we do not require proficiency in C++, proficiency in at least one programming language is required. A basic knowledge of Unix or Linux is also helpful. If you do not have the prerequisites and would still like to take the course, please contact one of the instructors.

Readings will be taken from a variety of sources; PDF versions of the appropriate readings will be made available before each lecture. There is no required text; below are recommended and reference texts that students might find useful. (Prices are from Amazon as of Jan. 2016.)

Recommended text:

• Speech Synthesis and Recognition, John Holmes and Wendy Holmes (2nd ed., paperback, 298 pp., 2001, ISBN 0748408576, $70) • Good introductory text covering many areas. Reference texts: • Theory and Applications of Digital Signal Processing, Rabiner, Schafer (hardcover, 1056 pp., 2010, ISBN 0136034284,$161)

• Reference for signal processing.

• Speech and Language Processing, Jurafsky, Martin (2nd ed., hardcover, 1024 pp., 2008, ISBN 0131873210, $141; also int'l edition) • Reference for language modeling and text processing. • Statistical Methods for Speech Recognition, Jelinek (hardcover, 305 pp., 1998, ISBN 0262100665,$53)

• Hardcore coverage of selected topics.

• Spoken Language Processing, Huang, Acero, Hon (paperback, 1008 pp., 2001, ISBN 0130226165, \$68)

• Exhaustive reference for ASR.

The coursework will consist of five programming assignments in C++ and a final reading project. The programming assignments will involve implementing various portions of a basic speech recognition system. Initially, a simple dynamic time warping recognizer will be written, and this will be incrementally extended during the semester to form a large vocabulary continuous speech recognizer. Students will be given accounts on the EE department's ILAB computer cluster to complete the programming assignments.

For the final reading project, students will be asked to read one or more papers about a topic not covered in depth in class, and to write a 1500-2500 word paper reviewing and analyzing the material. A list of suggested papers will be provided, or students may choose their own with instructor approval.

Instead of the final reading project, motivated students have the option of doing a programming/experimental project, either individually or in a group. A list of projects will be provided or students may propose their own, subject to approval from the instructors. Each team must write a paper describing their work and give a 10-15m presentation to the class.

The final grade will be broken down as follows:

For students who select the experimental project, the share of the final project will be raised to 40% (and the share of the labs reduced) if this results in a higher grade.

Stanley F. Chen <stanchen@us.ibm.com>
Last updated: 2016 Apr 05