Machine Listener

Most researchers would agree that the most important factor in the success of current speech recognition systems is their ability to learn statistical models from very large corpora of speech data. By basing speech models on literally hundreds of hours of recordings, those models can be made to accommodate a large proportion of the variability in voice, pronunciation and conditions that is encountered in real life. This kind of learning relies on three things: Having a good underlying model with parameters to be set, having an algorithm to update those parameters based on training data, and having the large collection of training data, including any required tags or labels. In the case of speech recognition systems, that generally means having the hundreds of hours manually transcribed to obtain the corresponding words.

Thinking beyond the realm of transcription systems, we are interested in developing systems that can listen to 'general audio' (i.e. the mix of real world sounds which surround us every day) and making some kind of analysis in terms of objects and sources that resemble those perceived by listeners. In comparison to work in speech recognition, work in computational auditory scene analysis (CASA) the effort to build automatic sound understanding systems is rather small-scale, and based on heuristic insights of researchers, rather than optimization of general-purpose models. If only we could think of a way to exploit large amounts of training data, that would surely lead to more successful systems.

Assuming we are interested in a relatively unconstrained acoustic domain, access to large datasets might be rather easy. For instance, we could just have a computer connected to a microphone, collecting the sound of its immediate environment. For more variety and activity, we could connect a conventional receiver of radio or TV broadcasts to a computer, to obtain a never-ceasing stream of 'real' audio. The problem with this data is that it has no associated hints of what it contains no labels so many pattern recognition algorithms cannot be used.

There are, however, a set of algorithms that attempt to learn statistical patterns in data without any example labels so-called 'unsupervised' learning such as K-means clustering, Gaussian mixture modeling, or Kohonen neural networks. These algorithms can be used to find distinct clusters within data, if any exist, and even for more continuously-distributed data, they will tend to break it down into several regions with different properties.

The idea behind this project is to use this kind of unsupervised learning algorithm to generate models of some broadcast acoustic signal or signals. Because the audio stream is always available, we don't have to worry about storing it when we want some more data, we simply start recording again. The goal in this project is to leverage the availability of effectively unlimited example data, and the untiring efforts of a computer that is monitoring it 24 hours a day, by devising algorithms that can continually refine and improve models by looking at very large amounts of data.

Based on the block diagram above, this project could proceed as follows:

Acoustic event detector: As a way of imposing a little bit of structure on the undifferentiated audio stream, we would start off with a hand-built algorithm to find 'events' in the stream. The idea is to convert the continuous stream into occasional discrete objects, described by a set of parameters, which can be used as separate training examples for the subsequent learning operations. This first stage could also perform some preliminary classification, e.g. into sound/music/other, so that we could direct our learning efforts into particular classes.
The question of which parameters are used to describe each of these candidate elements is of course very critical. Different representations will lead to totally different kinds of learned classes. Much of the project will likely involve iteration between changing the representation and experimenting with the clustering while we learn what can be acquired most successfully.
In the first instance, the classic auditory scene analysis cues of onset (cross-frequency energy increase) and harmonicity (common periodic modulation of the energy envelope within different subbands) will be used to detect the beginnings and ends of promising-looking events in the audio.
Unsupervised clustering: Given a working pre-processing stage generating a series of parameterized candidate events in real-time (or as close to real-time as can be managed), the next stage will be to work on machine learning and clustering algorithms to 'mine' for patterns and structure within this representation. At the simplest level, this can be modeling the distribtion of spectral feature vectors to notice particularly common classes of sound, although in general we will want to incorporate the time dimension through sequence models such as HMMs. There's a great deal of scope for experimenting with different kinds of basic model to see what kind of classes they can begin to learn (and how they interact with the representation).
Event template feedback: The most distinctive idea of this project is the limitless unlabeled data upon which it is based. The most interesting results will come from algorithms that can continue to learn new structure over long time scales. One way to do this is to provide some kind of feedback, so that as the clustering stage identifies potentially significant patterns within the input data, this can be fed back, automatically, to the event and feature extraction stages, thereby modifying the data being presented to the clustering stage. The risk here is of divergent modeling that shoots off in some random, uninformative direction. The challenge is to develop feedback schemes that actually refine and 'bootstrap' the information extraction process. As such, this stage should also include the capacity for interactive guidance, where human experts can evaluate the different patterns being pursued by the system, and reward those that seem to correspond to meaningful classes.

This approach of unsupervised acquisition of sound templates is inspired by the similar work on 'acoustic context awareness' by Brian Clarkson and others at the MIT Media Lab.

Last updated: $Date: 2000/12/11 17:18:01 $
Dan Ellis <[email protected]>