Department of Electrical Engineering - Columbia University

ELEN E6820 - Spring 2009

SPEECH AND AUDIO PROCESSING AND RECOGNITION

Home page

Columbia Courseworks

E6820 Project Suggestions for Spring 2009

I'm adding an update to my list of suggested projects, but I'm keeping the previous list unmodified. Some of the topics are repeated, but there has been some evolution in the meantime...

Chord recognition

We participated in the chord recognition evaluation at last year's MIREX. Our results put us 2nd of 7; I've put our entire system online. I'm not sure what we did differently from the best system, but I suspect it is down to making the chord transition matrix be dependent on the key of the piece (or equivalently using key normalization). This would be quite easy to try. There are also some ideas on how to improve the features, such as trying to eliminate the melody line from the representation of the current notes.

Rhythm extraction

We have a simple and effective beat tracking system that did quite well in the 2006 MIREX Beat Tracking evaluation (the last time it was held). There are, however, several aspects of this work that could be improved. Most obviously, it cannot handle changes in tempo mid-piece. I think the framework easily could handle this, but it needs a separate decision process that considers re-estimating the tempo. Secondly, it only estimates one level of tempo, whereas a real rhythmic description requires multiple levels, including downbeats for each bar. This is significantly more challenging, but could be very important for musical applications.

Music audio clustering

One of the interesting possibilities presented by very large collections of audio is clustering similar pieces, or similar segments within pieces. In theory this is straightforward e.g. calculate the distance (by some measure) between all pieces/segments, and look for groups with particularly low mutual distances. In practice, for any musical dataset that is large enough to possibly contain interesting clusters, this becomes computationally infeasible without some additional tricks. Two possibilities are (a) fingerprinting, which extracts compact tell-tale features that can be quickly referenced to a hash table to find repeats, and (b) locality-sensitive hashing, a very neat technique for finding near neighbors in continuous feature space in approximately constant time. Applications include "normalizing" a music collection to eliminate repeats, finding "quotes" (samples) of one musical piece in another, and, with the right musical representation, finding similarities between different pieces (common motifs or riffs, etc.) Similar techniques can be applied to online streams (e.g. web radio) to find repeating segments - this was the topic of a project a couple of years ago that could be developed further. Another idea is to do this for soundtracks from YouTube videos, which contain a lot of commercial music in various states of degradation.

Music Audio Segmentation

Although it has attracted a fair amount of attention, I feel the problem of segmenting music audio into verse, chorus, bridge, etc. is still not well solved. It would be nice to study this. A problem is the lack of clear ground-truth results; I'm currently intrigued by the idea of extracting the segment markers from Guitar Hero data for this purpose, which appears feasible.

E6820 Project Suggestions for Spring 2007

Here is an addendum to the list of project suggestions that reflect things I am currently interested in, or particular papers I've seen recently that might make the basis of an interesting investigation. It's based on one I did last year, but with a number of additions and modifications.

Music-related projects

Artist Identification

There was an evaluation at the 2005 ISMIR conference on automatically identifying the artist of an audio recording; the system from my lab came top with over 70% correct from a closed set of over 100 artists. It worked by using a support vector machine, which involves comparing song-level statistics between the test song and all of the songs in the training set, and an optimization procedure figures out which reference songs are most important (the "support vectors"). This makes me think that the whole song is probably not all that important; rather, there may be little snippets in various places that are particularly typical of a particular artist. One approach would be to chop an artist's work into a lot of short segments and test them all for the ability to pick out other work by that artist versus other artists - then with enough such snippets, each of which maybe only works a few times, you might be able to cover all the tell-tale `twists' of a particular artist. But how to compare the snippets between songs to preserve perceived similarity, and how to identify the best snippets, is far from obvious.

Chord recognition

A little while back I did a project with Alex Sheh on recognizing the chords in music using the HTK speech recognition toolkit; you can read about what we did in our ISMIR-2003 paper. That work, however, concluded before we really did what I was hoping to do, which was to try out some different kinds of novel features for detecting chords. This project would involve extending that work to use a larger training set to get better results, then comparing the PCP features Alex used with some alternative representations. I'm particularly interested in looking at subharmonic peaks, the greatest common divisors of the common notes in a chord.

Finding hidden 'modes' in music

As a generalization of the chord extraction work, we can note that Hidden Markov Model analysis provides a way to estimate a sequence of hidden states based on noisy or indirectly-related observations, and there are lots of possible 'hidden state' sequences that might be pulled out of a music signal, such as the particular key (mode) that a melody was being played in, the instrument, the verse/chorus etc. It would be interesting to try modeling music signals with a few HMM states, and seeing how different choices of features, initialization, and structure, resulted in different kinds of segmentation.

Transcription alignment

Another ISMIR-2003 paper that I did with Rob Turetsky involved aligning music recordings with MIDI replicas to get a highly-accurate transcript of the note events in the actual music. There may be better ways of doing this: we converted the MIDI to audio, then aligned on audio features, but Chris Raphael had a different approach based on a very simple model of the expected spectral peaks given the notes. Good transcripts of real music are still something we really need, so it would be nice to carry this through.

Rhythm extraction and Cover Song Identification

There has been a lot of work published at ISMIR and elsewhere on extracting the beats and rhythms from music recordings -- see for example the Rhythm Feature Extraction session at ISMIR 2004. There are lots of interesting possibilities building on or modifying this work. For the 2006 Mirex Cover Song Identification contest, I put together a beat tracker as a basis for the beat-synchronous chroma features described in our ICASSP paper. My Matlab code for beat tracking and chroma feature extraction is posted online.

Musical phrase segmentation

Repetition is a key element of much music, such as the verse/chorus structure of western popular music. There has been some interesting work done on identifying these repetitions e.g. by Bartsch & Wakefield and more recently Chai and Vercoe, as well as several others; a more complex system used these kinds of techniques as a basis for lyric alignment to music: see the LyricAlly system, and Goodwin & Laroche looked more generally at finding segments. This project would investigate this problem, as well as trying to use a machine-learning approach to identify specific cues to the break between different segments that can recur in this kind of music.

Musical key signature detection

If you play a listener with some musical training a segment of western, tonal music, they will usually be able to hum back the `tonic' note, the base of the chord that defines the `key signature'. It would be useful and interesting to be able to do this automatically from the audio signal; by using a very fine spectral analysis, it might also be possible to recover the precise tuning of the music, to allow later processing to account for slight, systematic mistunings of the instruments. There was a paper on this at last year's ISMIR by Steffen Pauws, but it would be interesting to try and find self-defined profiles, rather than the predefined ones he used.

Undoing the (musical) mix

Carlos Avendano has an impressive demo for picking out single instruments from a stereo music recording based on selecting time- frequency cells that match a particular mix criteria; he described it at WASPAA'03. Combine that with a scheme for automatically identifying the particular source locations, like the power-weighted histograms of Yilmaz & Rickard, and you could have an interesting way to take apart a music mix. In fact, just a visualization of the stereo field based on such a technique could be interesting, and might apply to natural sounds and not just music.

Environmental and mixed sounds

Identifying repeating events in environmental recordings

My lab has a slightly offbeat project going on in analyzing personal audio archives - the kind of recording you get by carrying around a little dictation-style recorder all day. The data is very easy to collect and yet presents many, many issues, such as how to browse it, and what kind of information can usefully be extracted, at what scale. One idea is to try to identify sound events that occur multiple times in the recordings, more or less the same. I particularly like the sparse-landmark approach of Shazam's music fingerprinting technology and I'd like to see if that can be applied. Security is also a very big deal with this kind of data; I'm intrigued by the idea of distributing the data so that no single agent has access to the raw data, yet certain features can still be calculated.

Model-based signal separation

The general problem of separating sound mixtures has long been a strong interest of mine. Recently, Sam Roweis has published some very interesting work based on using strong models of the individual source: see for instance his Eurospeech-03 paper on refiltering of speech mixtures. This project would involve a reimplementation of his method, and an investigation into some details not discussed, such as the effect of mismatch between signal and model, and possible extensions to the approach.

Alarm sound detection

I wrote a paper about detecting sounds like telephone rings and smoke alarms in high-noise backgrounds. It's a topic that hasn't really received any attention, but I think it's quite promising. However, there needs to be more development of my initial experiments, and improvements of the systems I built, which had very high error rates. My paper is here. Specifically, I'd like to try training the systems on a wider range of alarms and noises, and at different noise levels. I'd also like to try the techniques of missing-data recognition (as developed by Cooke et al., described in this paper) on these signals.

Modulation spectrum of natural sounds

A lot of interest has been generated by recent demonstrations that speech and certain other sounds can be recognized even when most of their detail is removed by resynthesizing only their broad energy envelopes in a few frequency bands (see e.g. Rob Drullman's papers in JASA 95(2) and JASA 97(1)). The only remaining information is in the modulation of those energy bands, which can be described by a modulation spectrum. This project would develop tools to extract the modulation spectrum from short sound examples, and work with some perceptual results (obtained by Valeriy Shafiro) to correlate objective measures with subjective detectibility.

Reverberation characterization

There's an interesting approach to characterizing the reverberation in a room by looking for common slopes in the energy decay curves described by Ratnam et al.; it would be interesting to investigate this idea and see if it might also be used for supressing reverberation. For a different approach to removing reverberation, I was struck by the approach of Nakatani, who looks for an inverse filter to make a reverberated signal look more like clean speech (see his ICASSP-2003 paper). I also have an idea for separating the resonances due to reverb from those that describe the signal by fitting LPC models to short segments of signal (e.g. from the meeting recorder) and looking for the poles that frequently recur, since they likely reflect the room not the source. Inverse filtering these resonances can help undo reverb.

Speech/voice projects

Speaking style normalization

One of the biggest problems facing current speech recognizers is the variations in speech arising from different speaking styles. Although human listeners have no trouble understanding either the crisp speech of a newsreader or the rapid discourse of a friend, from the computer's point of view, these signals seem almost unrelated. This project would study the relationship between these signals, perhaps by collecting new recordings of people speaking in different modes -- read versus spontaneous -- and look at ways to normalize away these differences. I have some ideas on how to do this, based on the spectral warping as described on one of my Matlab examples of audio processing.

Speaking rate study

Another issue currently bewildering speech recognizers is the variation of speaking rate i.e. how long each sound lasts when speaking. The project here would be to look at a large corpus of annotated speech (e.g. TIMIT), and try to build models that can accurately predict the duration of speech sounds based on some amount of context, or a few parameters.

Stutter detection

I was recently discussing the phenomenon of stuttering with a colleague: it's very common for children to go through a phase of stuttering during language development, but if the stuttering could be quantified, it might be possible to identify and remediate the cases in which stuttering may develop into an impediment. The first step in this would be to build a system to automatically identify stuttering events, which is interesting because it is more about noting the repetition of features than anything specific to the features themselves. To achieve this first step, we would need to obtain or create a database of labeled 'stutter' events, although I'm sure such things exist.

Narrator segmentation

There are all kinds of interesting things we'd like to do with the soundtrack of video material, such as trying to classify the background audio ambiance, picking out specific nonspeech events etc. But in a large number of cases, there is a very prominent voice-over track, a narrator talking about what is going on. Without special efforts to segment out this narration, the narrator's voice comes to dominate the characteristics of the audio, and the interesting `background' is lost. Instead, it would be useful to have something that specifically looks for the narrator/commentator's voice, and masks out those regions. I think we can do this quite well by using a classifier trained for speech recognition, simply to tell us when the signal is un-speech-like, although it might be interesting to add some speaker identification ideas on top (so that the narrator's voice can be distinguished from other voices).

Meeting Recorder Projects

The next couple of ideas relate to the Meeting Recorder project I am involved in. We have a new corpus of natural meetings recorded with 16 head-mounted and tabletop microphones, and we're looking at ways to get information out of a recording:

3D sound source localization from multiple microphone signals

Four of the recorded channels come from high-quality PZM mics placed on the conference table. By calculating the time difference between pairs of these mics, it is possible to triangulate the position of each voice (you will look at this in a practical later in the semester). However, we don't know exactly where the mics are, so it is necessary to iteratively estimate and re-estimate the positions of both mics and talkers. In some prior work, we showed that this approach was feasible, but we need to better understand its sensitivity to initialization and noise in the measurements. This project would involve some simulations of these kind of data to get best-case bounds on our approaches, then applying the same approaches to the real meeting data.

Speaker turn pattern segmentation

An interesting higher-level attribute of what happens in a meeting can be the pattern of who is speaking when. I've done a little bit of work trying to segment meetings into a few coarse segments, where the people speaking within each segment are relatively consistent. You can read about that work in the second half of this paper, which was published at ICASSP in 2003. There's a lot of further work to be done, for instance to investigate the effect of varying some of the model parameters, and also in trying to characterize the `intuitive' meaning of the segments we are retrieving, since they don't seem to correspond very well to hand-marked topic boundaries.

Dan Ellis <[email protected]>
Last updated: Thu Jan 22 10:25:50 EST 2009