SPEECH AND AUDIO PROCESSING AND RECOGNITION

ELEN E6820 - Spring 2001

Class projects

2001-06-12

The Spring 2001 semester is finished, and I was very pleased with all the projects that were completed by the class participants. As an illustrative record, here is a selection of the project reports:

Detecting sound events in basketball video archive by Dongqing Zhang
Content based audio retrieval based on hidden Markov models by Manuel Reyes
The S-Matrix: A novel approach to music segment detection by Robert Turetsky

The original content of this page is below.

Updated 2001-04-17 dpwe

A major part of the class will be the projects undertaken by students in some area of speech and audio processing and recognition. You are encouraged to be thinking about their projects from the earliest possible date, and to discuss ideas with me in order to develop the best project plans. Various resources, such as corpora of sound files or access to existing software tools, will be provided where possible.

Each project will culminate in a presentation, either in person in the final class or via your web pages (or both). It is expected that on-campus students will make in-person presentations. The final class of the semester (April 24th) will consist of these presentations. Don't worry about making the presentations terribly formal or polished, but think of them as an opportunity to explain to the rest of the class some of the key points that you learned from doing the project. I particularly encourage demonstrations and sound examples.

Each project will also be described by a short written report detailing the work undertaken. This can be in the form of a printed document, or as a set of web pages (which has the advantage of supporting linked examples). In both cases, the report should follow the format of a research publication, with an introduction describing the problem, a description of the approach, a presentation of the results, perhaps a discussion, and final conclusions. A written report will typically be 5-10 pages in length, including figures. Project reports should be handed in (or posted to the web) by the last day of classes, Monday April 30th.

Project scope and assessment

To give you some idea of the amount of work expected in the project, bear in mind that it accounts for 40% of a 4.5 credit class, which should work to something like an average of a day a week over the semester. At the same time, the best projects have simple and clear concepts at their core, rather than ballooning into vast investigations. Here's a recipe for one possible project 'shape':

Identify the area of the investigation.
Define a specific, concrete task within that area. For instance, in a sound classification project, this would be the set of target classes into which classification will be performed and a corpus that will be used for testing. In other cases, the goal might be more open-ended (for instance, making a recorded male speaker sound female), but should still be explicit, concrete, and have an identified target domain.
Define evaluation metrics. This is a crucial habit for quality research. Without some kind of measure of how you are getting on, it's too easy to get lost working on a problem without making real progress. For classification problems, evaluation is easily achieved by measuring error rates on a given test set. Other projects may require more thought to come up with suitable evaluations. For the voice transformation example, it is the subjective success ("does it sound like a male or female?") that really matters, but subjective impressions are hard to collect, so some other measure (signal-level distance from a prototype female utterance?) might be more useful.
Identify the particular approach you intend to use to solve your chosen problem - the kinds of features you plan to extract, the basic signal processing sequence, etc.
Make an implementation (and debug it!).
Measure its performance with your evaluation metrics. Also, make a qualitative investigation into its shortcomings. What are the aspects of its behavior that differ from what you hoped or intended? How might they be improved?
Based on this analysis, modify your implementation (or make a new implementation) in order to address some of the shortcomings.
Assess this new iteration, compare it to the original. Were you able to improve things relative to the first attempt? How has the pattern of performance changed?
If you haven't run out of time, you can repeat this cycle indefinitely.
Finally, step back and look at the whole path you've come down. If you were going to start the project over again, how would you do it? What have you learned about the nature of the problem in the course of your investigation? What are the most promising avenues for future work?

That recipe is sufficiently vague to cover a wide range of projects, but even so it isn't definitive. However, the emphasis on well-defined goals and evaluation standards (so you can be clear about what's relevant and what isn't), and the idea of iterating over an implementation in light of performance analysis, are aspects I consider very valuable.

The projects will be graded on several dimensions:

Project structure: How well the basic investigation is defined, how systematically it is pursued, how well the effort invested was balanced between different areas.
Technical content: The breadth and depth of understanding of audio processing-related ideas displayed within the project.
Presentation: How well the ideas and results of the project are communicated.

Finally, conciseness is always a virtue, particularly in the eyes of the reader. There is a fine art in editing down reports and presentations (and lectures!) to contain only the important points and nothing extraneous, while still presenting enough to make the material intelligible. As always, blindly generating vast volumes of results is a big warning sign that you should step back and refocus on your objectives.

Some project ideas

This list is offered to stimulate ideas, rather than to define some limited domain; interesting ideas that fall outside the categories below are also encouraged.

Speech recognition variants: There are several speech recognition frameworks available that can be used to build a working speech recognizer in a relatively modest amount of time. Access to existing implementations and corpora will allow students to focus on modifying a certain aspect of these highly complex systems (such as feature representations, model structures, or training procedure) and make quantitative measures of the impact on recognizer performance.
Audio compression variants: Different ideas for audio signal compression can be investigated either by starting from scratch or by modifying one of the packages available in source code. Bitrate reductions can be measured, although quality judgments are harder to obtain.
Nonspeech signal recognition: Many of the techniques used in speech recognition are in fact applicable over a much wider domain. Speech is only one kind of complex sound, of course; recognizers could be built for alarm sounds, particular acoustic events in movie soundtracks, animal calls etc. Suitable corpora and well-defined experiments can accurately measure the performance of such systems.
Speaker identification and characterization: Speech recognition has focussed on the lexical content of speech (i.e. the words) and worked quite hard to exclude other aspects of the signal. Yet when we listen to speech, we infer considerable information about the speaker, such as gender, age, country of origin etc. All this information should be present in the signal, it is simply a matter of finding the right features and training the right recognizer.
Spatial location analysis and synthesis: Auditory spatial perception is a favorite topic of psychoacoustics research, and many models have been proposed of how the brain recovers spatial information (azimuth, elevation and range) from the signals at the two ears. These models can be used both to attempt the recognition of a given sound's origin, and to synthesize sounds that appear to come from a specific point in space.
Prosody detection: Prosody refers to variable aspects of the speech signal apart from those defining the phonetic content; these include pitch (melody), timing, stress etc. The focus on speech transcription has left these aspects of the signal relatively neglected, yet they are certainly informative, particularly if we wish to understand more than simply the word sequence. This project would investigate extracting reliable correlates of such feature from speech signals.
Music synthesis: A huge range of algorithms have been used in computer and electronic music; some of these could be investigated, compared, and perhaps extended.
Music analysis: The 'holy grail' of automatic transcription of recorded music still seems quite distant, even for relatively constrained subsets, but there are other kinds of information, such as rhythm, genre, instruments and perhaps chord progressions or bass lines that can be more successfully identified.
Audio and music retrieval: It's not at all obvious how to define 'similarity' between two sounds, be they one-second sound effects or one-hour orchestral recordings. But if we could, we might be able to build a useful analog of a search engine working purely on sound. Several groups have tried; their approaches could be examined, or a new approach could be developed.
Temporal structure recovery: If you listen to just the soundtrack of a movie, you can probably get a pretty good idea of what's going on. Even if you don't understand the words, you may still recognize the sound effects, or respond to the soundtrack music. What useful coarse-time structural information can we recover simply by processing the sound channel of multimedia content?

Dan Ellis <[email protected]>
Last updated: Tue Jun 12 13:29:14 EDT 2001