Background and Questions:

The brain empowers humans and other animals with remarkable abilities to sense and perceive their acoustic environment even in highly degraded conditions. These seemingly trivial tasks for humans have proven extremely difficult to model and implement in machines. Our interdisciplinary research approach deals with the following fundamental questions:

  1. 1.What computation is done in the brain when we listen to complex sounds in various conditions?

  2. 2.How could this computation be modeled and implemented in machines?

  3. 3.How could one build an interface to connect brain signals to machines?

Representation of phonetic features in human auditory cortex:

During speech perception, linguistic elements such as consonants and vowels are extracted from a complex acoustic speech signal. The superior temporal gyrus (STG) participates in high-order auditory processing of speech, but how it encodes phonetic information is poorly understood. We recorded directly from the cortical surface in humans while they listened to natural, and found response selectivity to distinct phonetic features. Phonetic features could be directly related to tuning for spectrotemporal acoustic cues, some of which were encoded in a nonlinear fashion or by integration of multiple cues. These findings demonstrate the acoustic-phonetic representation of speech in human STG.

The cocktail party problem (Cherry 1953)

To extend our understanding of how the human brain behaves in degraded conditions, we engaged human subjects in a multispeaker speech perception task while invasively recording the neural activity from their brain. Strikingly, we found that the cortical representation of attended speech remains unchanged even when a second interfering speaker is added, almost as if the second voice is not present at all. In addition, we showed that one could decode which speaker and words a person is attending to just by analyzing the pattern of brain activity.

Speech processing in machines:

Using  a computational model that captures the hypothesized transformations of sound in auditory pathway has proven to be highly effective in a variety of speech processing applications. This algorithm has been used by many groups for the task of finding speech in heavily corrupted signals for its superior performance (ICASSP 2012). We have also used this model successfully in variety of other tasks, including speech enhancement, phoneme recognition, and speaker identification.

N. Mesgarani, M. Slaney and S. A. Shamma, (2006) “Content-based audio classification based on multiscale spectro-temporal features”, IEEE Trans. Speech and Audio, May 2006

T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X, Zhou, N. Mesgarani, K. Vesely, P. Matejka, (2012) “Developing a Speech Activity Detection System for the DARPA RATS Program”, Interspeech, Portland

N. Mesgarani and S. A. Shamma, (2007) “Denoising in the domain of spectro- temporal modulations”, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2007

Auditory processing strategies led us to develop information processing systems for dealing with unexpected inputs that commonly result in unpredictable behavior of current systems. This technique uses a statistical model of the expected outcome in normal conditions to adaptively suppress the unreliable channels while simultaneously enhancing the informative ones. This approach considerably reduces the current need for extensive training of speech processing algorithms.

N. Mesgarani, S. Thomas, H. Hermansky, (2011) “Adaptive Stream Fusion in Multistream Recognition of Speech”, International Conference on Speech and Language (Interspeech), Florence, Italy

N. Mesgarani, S. Thomas, H. Hermansky, (2011), “Toward optimizing stream fusion in multistream recognition of speech”, JASA Express Letters

Reconstructing speech from neural responses:

We developed a novel inverse mapping technique with which neural responses in ferret auditory cortex could be used to “reconstruct” the actual sound that the animal heard.

Mesgarani & Chang, “Selective cortical representation of attended speaker in multi-talker speech perception”, Nature 2012

The reconstruction algorithm is explained in this paper:

N. Mesgarani, S. V. David, J. B. Fritz, S. A. Shamma, (2009), “Influence of Context and Behavior on the Stimulus Reconstruction from Neural Activity in Primary Auditory Cortex”, Journal of neurophysiology

B. Pasley, S. V. David, N. Mesgarani, N. Crone, S. Shamma, R. Knight, E. F. Chang, (2012), “Reconstructing speech from human auditory cortex”, PLoS Biology 10(1)

Read about a realtime EEG implementation of this algorithm here.

What the ferret actually heardResearch_files/si590.wav

This method was then extended to recover the speech signal heard by a human, based on neural activity recorded in the cortex of the listener (Brian Pasley). It remains to be seen how much information can be decoded from the brain when one is listening/imagining speech.



In the news:

Berkeley  Press release

Coverage in BBC, news and video

What we reconstructed from the ferret’s brainResearch_files/Speech5_60.wav

Mesgarani, et. al., Phonetic feature encoding in human superior temporal gyrus, Science 2014

In the news:

Coverage in National Public Radio (NPR)

Web story at Columbia