Department of Electrical Engineering - Columbia University

ELEN E6820 - Spring 2009

SPEECH AND AUDIO PROCESSING AND RECOGNITION

Home page

Course outline

Matlab scripts

Problem sets

Projects

Columbia Courseworks

Class projects

2009-01-13: Here is the list of new project suggestions, some from this year, and some from earlier.

2009-01-13: Here are some nice example projects from the Spring 2008 semester class.

General description

A major part of the class will be the projects undertaken by students in some area of speech and audio processing and recognition. You are encouraged to be thinking about your project from the earliest possible date, and to discuss ideas with me in order to develop the best project plans. Various resources, such as corpora of sound files or access to existing software tools, will be provided where possible.

Each project will culminate in a presentation, either in person in the final class, or via your web pages (or both). It is expected that on-campus students will make in-person presentations. The final two classes of the semester (2009-04-28 and 2009-04-30) will consist of these presentations. You will also make a short project proposal presentation just before spring break (2009-03-10 and 2009-03-12), which will be the basis of your midterm grade -- more details below. Don't worry about making the presentations terribly formal or polished; think of them rather as an opportunity to explain to the rest of the class some key ideas leading to or learned from the project. The emphasis is on communication and sharing ideas and knowledge. I particularly encourage demonstrations and sound examples.

Each project will also be described by a short written report detailing the work undertaken. This can be in the form of a single document (printed or online), or as a set of web pages (which has the advantage of supporting linked examples). In both cases, the report should follow the broad format of a research publication, with an introduction describing the problem, a description of the approach, a presentation of the results, perhaps a discussion, and final conclusions. A written report will typically be 5-10 pages in length, including figures. Project reports should be handed in (or posted to the web) by one week after the final presentations, i.e. by Thursday 2009-05-07.

Project scope and assessment

To give you some idea of the amount of work expected in the project, bear in mind that it accounts for half of a 4.5 credit class, which should work to something like an average of a day a week over the semester. At the same time, the best projects have simple and clear concepts at their core, rather than ballooning into vast investigations. Here's a recipe for one possible project 'shape':

Identify the area of the investigation.
Define a specific, concrete task within that area. For instance, in a sound classification project, this would be the set of target classes into which classification will be performed and a corpus that will be used for testing. In other cases, the goal might be more open-ended (for instance, making a recorded male speaker sound female), but should still be explicit, concrete, and have an identified target domain.
Define evaluation metrics. This is a crucial habit for quality research. Without some kind of measure of how you are getting on, it's too easy to get lost working on a problem without making real progress. For classification problems, evaluation is easily achieved by measuring error rates on a given test set. Other projects may require more thought to come up with suitable evaluations. For the voice transformation example, it is the subjective success ("does it sound like a male or female?") that really matters, but subjective impressions are cumbersome to collect, so although subjective testing is the only real way to assess this work, some other measure (e.g. signal-level distance from a prototype female utterance?) might be more practical.
Identify the particular approach you intend to use to solve your chosen problem - the kinds of features you plan to extract, the basic signal processing sequence, etc.
Make an implementation (and debug it!).
Measure its performance with your evaluation metrics. Also, make a qualitative investigation into its shortcomings. What are the aspects of its behavior that differ from what you hoped or intended? How might they be improved?
Based on this analysis, modify your implementation (or make a new implementation) in order to address some of the shortcomings.
Assess this new iteration, compare it to the original. Were you able to improve things relative to the first attempt? How has the pattern of performance changed?
If you haven't run out of time, you can repeat this cycle indefinitely.
Finally, step back and look at the whole path you've come down. If you were going to start the project over again, how would you do it? What have you learned about the nature of the problem in the course of your investigation? What are the most promising avenues for future work?

That recipe is sufficiently vague to cover a wide range of projects, but even so it certainly isn't the only way to do things. However, the emphasis on well-defined goals and evaluation standards (so you can be clear about what's relevant and what isn't), and the idea of iterating over an implementation in light of performance analysis, are aspects I consider very valuable.

The projects will be graded on several dimensions:

Project structure: How well the basic investigation is defined, how systematically it is pursued, how well the effort invested was balanced between different areas.
Technical content: The breadth and depth of understanding of audio processing-related ideas displayed within the project.
Presentation: How well the ideas and results of the project are communicated.

Finally, conciseness is always a virtue, particularly in the eyes of the reader. There is a fine art in editing down reports and presentations (and lectures!) to contain only the important points and nothing extraneous, while still presenting enough to make the material intelligible. As always, blindly generating vast volumes of results is a big warning sign that you should step back and refocus on your objectives.

Project proposal presentations

All the on-campus students will make a brief oral presentation of their project plans in the class meetings of 2009-03-10 and 2009-03-12 (directly before spring break). Each presentation is limited to 5-7 minutes. The goal is to explain the general idea of the topic you are addressing, what experiments you will perform, and how you will assess the results. These presentations are assessed by rest the class as the 'midterm' component of the grade.

Some project ideas

This list is offered to stimulate ideas, rather than to define some limited domain; interesting ideas that fall outside the categories below are also encouraged.

Speech recognition variants: There are several speech recognition frameworks available that can be used to build a working speech recognizer in a relatively modest amount of time. Access to existing implementations and corpora will allow students to focus on modifying a certain aspect of these highly complex systems (such as feature representations, model structures, or training procedure) and make quantitative measures of the impact on recognizer performance.
Audio compression variants: Different ideas for audio signal compression can be investigated either by starting from scratch or by modifying one of the packages available in source code. Bitrate reductions can be measured, although quality judgments are harder to obtain.
Nonspeech signal recognition: Many of the techniques used in speech recognition are in fact applicable over a much wider domain. Speech is only one kind of complex sound, of course; recognizers could be built for alarm sounds, particular acoustic events in movie soundtracks, animal calls etc. Suitable corpora and well-defined experiments can accurately measure the performance of such systems.
Speaker identification and characterization: Speech recognition has focused on the lexical content of speech (i.e. the words) and worked quite hard to exclude other aspects of the signal. Yet when we listen to speech, we infer considerable information about the speaker, such as gender, age, country of origin etc. All this information should be present in the signal, it is simply a matter of finding the right features and training the right recognizer.
Spatial location analysis and synthesis: Auditory spatial perception is a favorite topic of psychoacoustics research, and many models have been proposed of how the brain recovers spatial information (azimuth, elevation and range) from the signals at the two ears. These models can be used both to attempt the recognition of a given sound's origin, and to synthesize sounds that appear to come from a specific point in space.
Prosody detection: Prosody refers to variable aspects of the speech signal apart from those defining the phonetic content; these include pitch (melody), timing, stress etc. The focus on speech transcription has left these aspects of the signal relatively neglected, yet they are certainly informative, particularly if we wish to understand more than simply the word sequence. This project would investigate extracting reliable correlates of such feature from speech signals.
Music synthesis: A huge range of algorithms have been used in computer and electronic music; some of these could be investigated, compared, and perhaps extended.
Music analysis: Automatic transcription of recorded music is still a major challenge, even for relatively constrained subsets, but there are other kinds of information, such as rhythm, genre, instruments and perhaps chord progressions or bass lines that can be more successfully identified.
Audio and music retrieval: It's not at all obvious how to define 'similarity' between two sounds, be they one-second sound effects or one-hour orchestral recordings. But if we could, we might be able to build a useful analog of a search engine working purely on sound. Several groups have tried; their approaches could be examined, or a new approach could be developed.
Temporal structure recovery: If you listen to just the soundtrack of a movie, you can probably get a pretty good idea of what's going on. Even if you don't understand the words, you may still recognize the sound effects, or respond to the soundtrack music. What useful coarse-time structural information can we recover simply by processing the sound channel of multimedia content?

Dan Ellis <[email protected]>
Last updated: Thu Jan 22 10:26:30 EST 2009