Student Research Projects
I am looking for some masters' degree students to work on the following
research projects for the Fall 2000 semester. If you are interested, please
contact me, including a brief
resume of your relevant experience. These projects will most likely involve
programming in C/C++ on a Linux platform, so experience in this domain is
important.
Thanks - Dan Ellis <[email protected]>
Meeting Recorder
This is a large project that I helped set up at my previous position
with the International Computer Science Institute (ICSI) in Berkeley CA.
We are interested in the problem of automatic processing of audio recorded
during conventional meetings. Currently, we have collected a few hours of
recordings, which we are having transcribed. The audio is recorded simultaneously
on 16 channels, both on head-mounted microphones and by microphones placed
on the conference table. The eventual goal is to develop useful speech recognition
from desktop microphones, and to use this for summarization and retrieval
of recorded information, but there are many stages along this path. Full
details of the meeting recorder data capture setup are available on my ICSI Meeting
Recorder web site.
There are several projects I would like to pursue based on analysis of
the existing recordings, including:
- Speaker turn detection: Meetings involve many different speakers,
and an often rapid alternation between speakers, including periods of overlap.
We have recordings from head-mounted microphones for each speaker, which
should at least allow us to establish the 'ground truth' of who is speaking
when, but we ideally want to be able to detect speaker changes and overlaps
based on the desktop mic channels (either from a single channel, or using
multichannel information). This project will involve establishing the ground
truth speaker turns from the multichannel recordings, then attempting speaker
segmentation (and identification) using standard change-detection algorithms
(such as the ones described in Ferreiros
& Ellis, ICSLP 2000).
- Speech feature development for distant microphones: For the
eventual (difficult) goal of high-quality word recognition from the desktop
microphones, we will probably need to develop novel feature representations
better able to handle the degradation due to background noise and reverberation
present in these signals. This project will experiment with feeding both
close-talking and distant-talking recordings into an existing speech recognition
system (trained on the DARPA Broadcast New corpus), and will investigate
the effect of various robustness transformations aimed at improving the
performance of the distant speech (such as the Modulation Spectrogram and
the Tandem approaches described by Hermansky,
Ellis & Sharma at ICASSP-2000).
- Nonlexical event recognition: Although the primary focus of
the project is on speech recognition for meetings, there is a great deal
of other information present in the recordings that might be useful in
an eventual searching application. For instance, apart from the actual
words, recordings of speech contain information about the style of the
dialog (whether people are speaking formally or informally, the distribution
of pauses etc.) as well as many non-word events such as laughter. This
project will involve first the definition and hand-labelling of various
nonlexical events within the data, then the development of suitable feature
representations and classifiers to detect these events.
- Browsing tools: As we collect more and more data, including
automatically-derived labels such as word alignments, we will need increasingly
sophisticated tools for browsing and investigating the data. This project
would be appropriate for a student with experience in graphical user interface
programming and who would be motivated to develop novel representations
and interaction methods for a large database consisting of speech, text
and other derived features and attributes.
Tandem Acoustic Modeling for Automatic Speech Recognition
This is a new approach to modeling the speech signal that combines the
standard Gaussian Mixture Model / Hidden Markov Model (GMM/HMM) approach
with the more unusual connectionist (neural network) approach. Our experiments
last year in a connected digits task achieved an improvement of more than
50% in word error rate over a standard baseline system, as reported by Hermansky,
Ellis & Sharma at ICASSP-2000 (see also these slides from a recent talk on the Tandem approach in PDF format). However, that system is really just
a first pass; we would like to investigate variations to determine which
parts are most important and how they can be improved. Projects in this
general area include:
- Development of alternative training targets: The neural network
in the Tandem system is trained to the standard phone targets used in a
connectionist-only recognizer. However, there is no a-priori reason to
assume that these are the optimal targets for the Tandem structure. This
project will involve bringing up the Tandem training system in our lab,
and experimenting with various alternative training targets such as those
derived from HMM model states or articulatory descriptions of the words.
- Sensitivity of the Tandem architecture: Our existing system
has a large number of parameters, such as the neural net size and structure,
and various aspects of the HMM/GMM system, that have not been investigated.
This project will involve systematic investigation of the dependence of
the overall system performance on parameters such as the neural net layer
sizes, the size of the feature vector passed to the HMM/GMM system, and
attributes of the HMM/GMM system such as the use of delta-features and
perhaps augmentation with baseline signal features, or other standard HMM/GMM
techniques.
Recognition based on partial information
I am involved in the European project
RESPITE which is concerned with speech recognition when
some of the underlying data is missing or obscured. Within this project,
there are a number of ongoing open research areas:
- Features for missing data recognition: Partners Sheffield
University have pioneered the use of missing feature techniques to
recognize speech even when certain frequency bands are not available.
This project would involve reproducing their results (they make some
of their software available), then investigating the effect of using
different feature representations, e.g. by filtering conventional
spectral filters along the time and frequency axes.
- Tagging speech and background: The missing-data approach
relies on some other mechanism to indicate which features are
reliable and which should be considered obscured. So far, we have
used relatively crude signal-to-noise ratio estimates for this.
A more sophisticated approach would be to use some local property
of the signal, such as periodicity, as the basis for these masks.
This project would adapt the weft representation
to provide masks for a missing-data recognizer.
Unsupervised learning of audio signals
One of the main research themes of LabROSA is automatic extraction of
audio content structure for use in indexing and retrieval. The ideal is
to simulate the skills of a human 'librarian' who will preview a large
archive of multimedia material, figure out the significant, recurrent
content, and build an appropriate index.
An important step towards this goal would be the development of algorithms
that can recognize recurrent patterns or structures in large audio
databases without any manual input or labels - i.e. via unsupervised
learning. There are several threads I would like to pursue:
- Unsupervised HMM modeling: HMMs are typically used in
speech recognition to model words in terms of specific, predefined
pronunciations i.e. 'supervised' training. They can, however, also
be applied without any constraints on the content, based only on
implicit clustering. This project involves modeling speech and
other signals via HMMs without label constraints to see what kind
of patterns will emerge.
- Acoustic event detection and classification: The auditory
system is particularly sensitive to sudden changes - particularly
increases in energy - because these are very often indicative of
significant events in the environment. A starting point for defining
classes of events in audio archives would be to build an 'onset
detector' that lays down a set of landmark times throughout the
archives, then to extract features and perform clustering on these
different discrete events. This project will pursue this idea on
an archive of recorded broadcasts.
- Detection of exact repeats: Broadcasts in particular often
contain segments that are repeated verbatim at different times.
Examples include theme music, news clips and of course commercials.
It should be fairly easy to detect these repetitions, although
the problem can rapidly become intractable for a large archive.
This project will examine several different approaches to detecting
these repeated episodes, as well as looking at the robustness
(misses versus false alarm performance) of the techniques.
Last updated: $Date: 2000/09/11 17:37:32 $
Dan Ellis <[email protected]>