|
Home page
Course outline
Problem sets
Projects
Matlab
Sounds
Resources
|
Department of Electrical Engineering
- Columbia University
SPEECH AND AUDIO PROCESSING AND RECOGNITION
ELEN E6820 - Spring 2001
Class projects
2001-06-12
The Spring 2001 semester is finished, and I was very pleased with all the
projects that were completed by the class participants. As an illustrative
record, here is a selection of the project reports:
The original content of this page is below.
Updated 2001-04-17 dpwe
A major part of the class will be the projects undertaken by students
in some area of speech and audio processing and recognition. You are
encouraged to be thinking about their projects from the earliest possible
date, and to discuss ideas with me in order to develop the best
project plans. Various resources, such as corpora of sound files or access
to existing software tools, will be provided where possible.
Each project will culminate in a presentation, either in person in the
final class or via your web pages (or both). It is expected that on-campus
students will make in-person presentations. The final class of the semester
(April 24th) will consist of these presentations.
Don't worry about making the presentations terribly formal or polished,
but think of them as an opportunity to explain to the rest of the class
some of the key points that you learned from doing the project.
I particularly encourage demonstrations and sound examples.
Each project will also be described by a short written
report detailing the work undertaken. This can be in the form of a
printed document, or as a set of web pages (which has the advantage
of supporting linked examples). In both cases, the report should
follow the format of a research publication, with an introduction
describing the problem, a description of the approach, a presentation
of the results, perhaps a discussion, and final conclusions.
A written report
will typically be 5-10 pages in length, including figures.
Project reports should be handed in (or posted to the web) by
the last day of classes, Monday April 30th.
Project scope and assessment
To give you some idea of the amount of work expected in the project,
bear in mind that it accounts for 40% of a 4.5 credit class, which
should work to something like an average of a day a week over the
semester. At the same time, the best projects have simple and clear
concepts at their core, rather than ballooning into vast investigations.
Here's a recipe for one possible project 'shape':
- Identify the area of the investigation.
- Define a specific, concrete task within that area. For instance, in a sound classification project, this would be the set of target classes into which classification will be performed and a corpus that will be used for testing. In other cases, the goal might be more open-ended (for instance, making a recorded male speaker sound female), but should still be explicit, concrete, and have an identified target domain.
- Define evaluation metrics. This is a crucial habit for quality research. Without some kind of measure of how you are getting on, it's too easy to get lost working on a problem without making real progress. For classification problems, evaluation is easily achieved by measuring error rates on a given test set. Other projects may require more thought to come up with suitable evaluations. For the voice transformation example, it is the subjective success ("does it sound like a male or female?") that really matters, but subjective impressions are hard to collect, so some other measure (signal-level distance from a prototype female utterance?) might be more useful.
- Identify the particular approach you intend to use to solve your chosen problem - the kinds of features you plan to extract, the basic signal processing sequence, etc.
- Make an implementation (and debug it!).
- Measure its performance with your evaluation metrics. Also, make a qualitative investigation into its shortcomings. What are the aspects of its behavior that differ from what you hoped or intended? How might they be improved?
- Based on this analysis, modify your implementation (or make a new implementation) in order to address some of the shortcomings.
- Assess this new iteration, compare it to the original. Were you able to improve things relative to the first attempt? How has the pattern of performance changed?
- If you haven't run out of time, you can repeat this cycle indefinitely.
- Finally, step back and look at the whole path you've come down. If you were going to start the project over again, how would you do it? What have you learned about the nature of the problem in the course of your investigation? What are the most promising avenues for future work?
That recipe is sufficiently vague to cover a wide range of projects, but even so it isn't definitive. However, the emphasis on well-defined goals and evaluation standards (so you can be clear about what's relevant and what isn't), and the idea of iterating over an implementation in light of performance analysis, are aspects I consider very valuable.
The projects will be graded on several dimensions:
- Project structure: How well the basic investigation is defined,
how systematically it is pursued, how well the effort invested
was balanced between different areas.
- Technical content: The breadth and depth of understanding of
audio processing-related ideas displayed within the project.
- Presentation: How well the ideas and results of the project are
communicated.
Finally, conciseness is always a virtue, particularly in the eyes of
the reader. There is a fine art in editing down reports and presentations
(and lectures!) to contain only the important points and nothing extraneous,
while still presenting enough to make the material intelligible. As always,
blindly generating vast volumes of results is a big warning sign that you
should step back and refocus on your objectives.
Some project ideas
This list is offered to stimulate ideas, rather than to define some
limited domain; interesting ideas that fall outside the categories
below are also encouraged.
- Speech recognition variants: There are several speech recognition
frameworks available that can be used to build a working speech recognizer
in a relatively modest amount of time. Access to existing implementations
and corpora will allow students to focus on modifying a certain aspect
of these highly complex systems (such as feature representations, model
structures, or training procedure) and make quantitative measures of the
impact on recognizer performance.
- Audio compression variants: Different ideas for audio signal
compression can be investigated either by starting from scratch or by modifying
one of the packages available in source code. Bitrate reductions can be
measured, although quality judgments are harder to obtain.
- Nonspeech signal recognition: Many of the techniques used in
speech recognition are in fact applicable over a much wider domain. Speech
is only one kind of complex sound, of course; recognizers could be built
for alarm sounds, particular acoustic events in movie soundtracks, animal
calls etc. Suitable corpora and well-defined experiments can accurately
measure the performance of such systems.
- Speaker identification and characterization: Speech recognition
has focussed on the lexical content of speech (i.e. the words) and worked
quite hard to exclude other aspects of the signal. Yet when we listen to
speech, we infer considerable information about the speaker, such as gender,
age, country of origin etc. All this information should be present in the
signal, it is simply a matter of finding the right features and training
the right recognizer.
- Spatial location analysis and synthesis: Auditory spatial perception
is a favorite topic of psychoacoustics research, and many models have been
proposed of how the brain recovers spatial information (azimuth, elevation
and range) from the signals at the two ears. These models can be used both
to attempt the recognition of a given sound's origin, and to synthesize
sounds that appear to come from a specific point in space.
- Prosody detection: Prosody refers to variable aspects of the
speech signal apart from those defining the phonetic content; these include
pitch (melody), timing, stress etc. The focus on speech transcription has
left these aspects of the signal relatively neglected, yet they are certainly
informative, particularly if we wish to understand more than simply the
word sequence. This project would investigate extracting reliable correlates
of such feature from speech signals.
- Music synthesis: A huge range of algorithms have been used in
computer and electronic music; some of these could be investigated, compared,
and perhaps extended.
- Music analysis: The 'holy grail' of automatic transcription
of recorded music still seems quite distant, even for relatively constrained
subsets, but there are other kinds of information, such as rhythm, genre,
instruments and perhaps chord progressions or bass lines that can be more
successfully identified.
- Audio and music retrieval: It's not at all obvious how to define
'similarity' between two sounds, be they one-second sound effects or one-hour
orchestral recordings. But if we could, we might be able to build a useful
analog of a search engine working purely on sound. Several groups have
tried; their approaches could be examined, or a new approach could be developed.
- Temporal structure recovery: If you listen to just the soundtrack
of a movie, you can probably get a pretty good idea of what's going on.
Even if you don't understand the words, you may still recognize the sound
effects, or respond to the soundtrack music. What useful coarse-time structural
information can we recover simply by processing the sound channel of multimedia
content?
Dan Ellis
<[email protected]>
Last updated: Tue Jun 12 13:29:14 EDT 2001
|