Towards End-to-End Speech Recognition

September 16, 2015
Speaker: Tara Sainath, Google Research


In this talk, I will discuss various efforts in our group at Google towards replacing various parts of the acoustic modeling pipeline with neural networks. First, I will describe a new modeling approach known as Convolutional, Long Short-Term Memory, Deep Neural Networks (CLDNNs), and why this architecture makes sense for speech tasks. Next, I will talk about using CLDNNs for raw-waveform modeling, allowing us to remove front-end log-mel filterbank feature computation. Finally, I will discuss CTC, which allows us to remove the need for a prior alignment and CD states.

Hosted by Colin Raffel.

