DAn Ellis: Research Overview

Dan Ellis : Research Overview

Dan Ellis is the principal investigator of the Laboratory for Recognition and Organization of Speech and Audio (LabROSA) at Columbia University, which he founded in 2000. Research at the lab focuses on extracting information from sound in many domains and guises; this page gives an overview of the research areas and the linking themes.

Introduction: Information from Sound

Sound carries information; that is why we and other animals have evolved a sense of hearing. But the useful information in sound -- what events are happening nearby, and where they are -- can appear in a convoluted and variable form, and, worse still, is almost always mixed together with sound from other, simultaneous sources. Thus, the problem of extracting high-level, perceptually-relevant information from sound is complex and involved, and this problem is the focus of my research.

The goal of my group at Columbia -- the Laboratory for Recognition and Organization of Speech and Audio, or LabROSA -- is to develop and apply signal processing and machine learning techniques across the wide range of audio signals commonly encountered in daily life. This includes extracting many kinds of information from speech signals, music recordings, and environmental or `ambient' sounds.

My Ph.D. research addressed separating real-world sound mixtures into discrete sound `events', based on processing designed to duplicate the effects known in psychoacoustics as auditory scene analysis. During my postdoc at the International Computer Science Institute (ICSI), in Berkeley, I learned the data-driven, statistical approach that prevails in speech recognition. In setting up my own lab at Columbia, my goal was to apply and extend these techniques across a wider range of sounds, including the problem of separating sound mixtures, since the majority of work in sound analysis has assumed that recognition needs only global features (such as cepstra), as if the target source was the only significant contributor to the sound.

While there are numerous groups working on speech recognition, and several groups specializing in music processing, or signal separation, or content-based retrieval, LabROSA is the only group that combines all these areas. We address identifying the information in the many different kinds of sounds of interest to listeners, with an emphasis on real-world environments and the particular issues that they pose, such as the availability of copious unlabeled data, and the problems of mixed sources and background interference.

All these areas and applications have a close relationship in terms of the essential characteristics of the signal such as pitch and noise interference, and in terms of the techniques, from autocorrelation to hidden Markov modeling, that are most useful. This is the logic behind having a lab that tries to encompass all aspects of extracting information from sound, and it has been borne out in practice time and again, as tools and techniques turn out to be useful across multiple domains -- such as the piano transcription pitch tracker that is useful for picking out speech in noise, or the classifier developed for music similarity that works well at identifying emphasized words in natural speech.

Recent and current projects in the group break down into 5 main areas:

Speech processing and recognition
Source separation and organization
Music audio information extraction
"Personal Audio" (environmental sound) organization
Marine mammal sound recognition

Each of these is discussed in more detail below.

Speech processing and recognition

We have focused on the problem of recognizing speech in adverse conditions, including development of the 'tandem' approach to acoustic modeling, where a neural network is used as a discriminant preprocessor to a conventional GMM-HMM system [HermES00, EllisR01], and more recently the development of novel features based on temporal envelopes extracted via linear prediction [AthinHE04a, AthinHE04b, AthinE03a, AthinE03b, MorganZ05]. Our current work, funded by the DARPA GALE project, is looking at speech prosody (e.g. pitch, energy, and timing) to extract information beyond the words such as which words are being emphasized - something that could be helpful for the machine translation aspects of GALE. As part of this, we plan to conduct subjective experiments to verify our theories about which aspects of pitch are perceptually salient, by seeing if listeners can distinguish between real and 'simplified' pitch tracks.

We have been working for several years with recordings of natural meetings, including identifying speaker turns based on multichannel recordings [PfauES01, RenalsE03, EllisL04] and recognizing specific events such as laughter or emphasized utterances [KennE04, KennE03].

See also the speech projects page.

Source separation and organization

Looking specifically at the problem of separating speech mixtures, we have developed several models for identifying overlaps and inferring masked properties, including the speech fragment decoder [CookeE01, BarkCE05], deformable spectrograms [ReyesJE04, ReyesEJ04] and dictionary-based systems [EllisW06] We have also looked at the difficult problem of how to evaluate this kind of system [Ellis04].

We have just begun a collaboration with labs at Boston University, Ohio State, and the East Bay Institute for Research and Education, funded by the NSF and on which I am Principal Investigator, to develop and combine models for separating speech with the specific goal of improving intelligibility for human listeners. This is a very complex objective that has been mostly ignored by the more signal-processing oriented work in source separation, but something we can address by collaborating with two labs expert in psychoacoustic experiments. We have had strong interest in this work from several other bodies, in particular some hearing aid manufacturers, and we hope eventually to be able to show benefit from our techniques for a hearing-impaired population.

See also the speech separation page.

Music audio information extraction

For several years we have been looking at the problem of music similarity, particularly with a view to making recommendations and organizing personal collections [BerenEL02, BerenEL03, WhitE04], again putting an emphasis on the problems of evaluation [EllisWBL02, BerenLEW03]. In support of similarity measurement, we have worked on extracting a variety of listener-relevant features from music signals including chords, rhythm, melody, etc. [ShehE03, EllisA04, TuretE03].

We have been strong advocates of common evaluation standards for this kind of work; we were closely involved in the first international evaluations held in 2004 at the International Symposium on Music Information Retrieval in Barcelona. For the 2005 conference in London there was an even larger set of evaluations; we again came top in artist identification (among 7 international teams) [MandelE05, MandPE06].

Another evaluation concerned reducing full music recordings to their essential melody; we came a close third in a field of ten participants; this is notable since our approach, based on machine learning rather than expert-designed models, was radically simpler than the others and made many fewer assumptions about the material [PolinerE05, EllisP06]. Our music work has been supported by industry and by a grant from the Columbia Academic Quality Fund.

See also the music projects page.

"Personal audio" (environmental sound) organization

A unique aspect of the group is our emphasis on natural sound environments and events -- sounds other than speech and music that actually comprise the overwhelming majority of sounds experienced by hearing beings. We have looked at applying the kind of machine learning and pattern recognition techniques developed for speech recognition to problems such as detecting alarm sounds and rhythmic claps [Ellis01, LessE05], and to analyzing the perceptual attributes of machine sounds [DobsWE05].

Recently, we have been investigating the analysis of long-duration 'personal audio' recordings to see if infrequently-changing categories such as location and activity can be effectively recovered from such data, and to see what kind of novel diary and memory-prosthesis applications can be made possible by this neglected opportunity for lightweight data collection [EllisL04a, EllisL04b]. This work is supported by an NSF CAREER grant ("The Listening Machine") as well under the Microsoft Research "Memex" program.

See also the personal audio page.

Marine mammal sound recognition

Finally, we have also begun to look at underwater sound with relation to marine mammals. Although this is stretching the mission of the lab a little, there are many human listeners currently engaged in analyzing recordings of whale and dolphin sounds, something we would like to improve with computers. So far, we have worked on automatic clustering of whale clicks [HalkE06a] and segregating dolphin whistles from mixtures [HalkE06b]. We have recently begun collaborating with Dr. Diana Reiss, a prominent dolphin biologist at the New York Aquarium, and are developing a project with her to recognize and classify dolphin whistles in real time -- the equivalent of speech recognition for dolphins!

Last updated: $Date: 2006/06/05 15:09:02 $
Dan Ellis <[email protected]>