Student Research Projects

I am looking for some masters' degree students to work on the following research projects for the Fall 2000 semester. If you are interested, please contact me, including a brief resume of your relevant experience. These projects will most likely involve programming in C/C++ on a Linux platform, so experience in this domain is important.

Thanks - Dan Ellis <[email protected]>

Meeting Recorder

This is a large project that I helped set up at my previous position with the International Computer Science Institute (ICSI) in Berkeley CA. We are interested in the problem of automatic processing of audio recorded during conventional meetings. Currently, we have collected a few hours of recordings, which we are having transcribed. The audio is recorded simultaneously on 16 channels, both on head-mounted microphones and by microphones placed on the conference table. The eventual goal is to develop useful speech recognition from desktop microphones, and to use this for summarization and retrieval of recorded information, but there are many stages along this path. Full details of the meeting recorder data capture setup are available on my ICSI Meeting Recorder web site.

There are several projects I would like to pursue based on analysis of the existing recordings, including:

Speaker turn detection: Meetings involve many different speakers, and an often rapid alternation between speakers, including periods of overlap. We have recordings from head-mounted microphones for each speaker, which should at least allow us to establish the 'ground truth' of who is speaking when, but we ideally want to be able to detect speaker changes and overlaps based on the desktop mic channels (either from a single channel, or using multichannel information). This project will involve establishing the ground truth speaker turns from the multichannel recordings, then attempting speaker segmentation (and identification) using standard change-detection algorithms (such as the ones described in Ferreiros & Ellis, ICSLP 2000).
Speech feature development for distant microphones: For the eventual (difficult) goal of high-quality word recognition from the desktop microphones, we will probably need to develop novel feature representations better able to handle the degradation due to background noise and reverberation present in these signals. This project will experiment with feeding both close-talking and distant-talking recordings into an existing speech recognition system (trained on the DARPA Broadcast New corpus), and will investigate the effect of various robustness transformations aimed at improving the performance of the distant speech (such as the Modulation Spectrogram and the Tandem approaches described by Hermansky, Ellis & Sharma at ICASSP-2000).
Nonlexical event recognition: Although the primary focus of the project is on speech recognition for meetings, there is a great deal of other information present in the recordings that might be useful in an eventual searching application. For instance, apart from the actual words, recordings of speech contain information about the style of the dialog (whether people are speaking formally or informally, the distribution of pauses etc.) as well as many non-word events such as laughter. This project will involve first the definition and hand-labelling of various nonlexical events within the data, then the development of suitable feature representations and classifiers to detect these events.
Browsing tools: As we collect more and more data, including automatically-derived labels such as word alignments, we will need increasingly sophisticated tools for browsing and investigating the data. This project would be appropriate for a student with experience in graphical user interface programming and who would be motivated to develop novel representations and interaction methods for a large database consisting of speech, text and other derived features and attributes.

Tandem Acoustic Modeling for Automatic Speech Recognition

This is a new approach to modeling the speech signal that combines the standard Gaussian Mixture Model / Hidden Markov Model (GMM/HMM) approach with the more unusual connectionist (neural network) approach. Our experiments last year in a connected digits task achieved an improvement of more than 50% in word error rate over a standard baseline system, as reported by Hermansky, Ellis & Sharma at ICASSP-2000 (see also these slides from a recent talk on the Tandem approach in PDF format). However, that system is really just a first pass; we would like to investigate variations to determine which parts are most important and how they can be improved. Projects in this general area include:

Development of alternative training targets: The neural network in the Tandem system is trained to the standard phone targets used in a connectionist-only recognizer. However, there is no a-priori reason to assume that these are the optimal targets for the Tandem structure. This project will involve bringing up the Tandem training system in our lab, and experimenting with various alternative training targets such as those derived from HMM model states or articulatory descriptions of the words.
Sensitivity of the Tandem architecture: Our existing system has a large number of parameters, such as the neural net size and structure, and various aspects of the HMM/GMM system, that have not been investigated. This project will involve systematic investigation of the dependence of the overall system performance on parameters such as the neural net layer sizes, the size of the feature vector passed to the HMM/GMM system, and attributes of the HMM/GMM system such as the use of delta-features and perhaps augmentation with baseline signal features, or other standard HMM/GMM techniques.

Recognition based on partial information

I am involved in the European project RESPITE which is concerned with speech recognition when some of the underlying data is missing or obscured. Within this project, there are a number of ongoing open research areas:

Features for missing data recognition: Partners Sheffield University have pioneered the use of missing feature techniques to recognize speech even when certain frequency bands are not available. This project would involve reproducing their results (they make some of their software available), then investigating the effect of using different feature representations, e.g. by filtering conventional spectral filters along the time and frequency axes.
Tagging speech and background: The missing-data approach relies on some other mechanism to indicate which features are reliable and which should be considered obscured. So far, we have used relatively crude signal-to-noise ratio estimates for this. A more sophisticated approach would be to use some local property of the signal, such as periodicity, as the basis for these masks. This project would adapt the weft representation to provide masks for a missing-data recognizer.

Unsupervised learning of audio signals

One of the main research themes of LabROSA is automatic extraction of audio content structure for use in indexing and retrieval. The ideal is to simulate the skills of a human 'librarian' who will preview a large archive of multimedia material, figure out the significant, recurrent content, and build an appropriate index.

An important step towards this goal would be the development of algorithms that can recognize recurrent patterns or structures in large audio databases without any manual input or labels - i.e. via unsupervised learning. There are several threads I would like to pursue:

Unsupervised HMM modeling: HMMs are typically used in speech recognition to model words in terms of specific, predefined pronunciations i.e. 'supervised' training. They can, however, also be applied without any constraints on the content, based only on implicit clustering. This project involves modeling speech and other signals via HMMs without label constraints to see what kind of patterns will emerge.
Acoustic event detection and classification: The auditory system is particularly sensitive to sudden changes - particularly increases in energy - because these are very often indicative of significant events in the environment. A starting point for defining classes of events in audio archives would be to build an 'onset detector' that lays down a set of landmark times throughout the archives, then to extract features and perform clustering on these different discrete events. This project will pursue this idea on an archive of recorded broadcasts.
Detection of exact repeats: Broadcasts in particular often contain segments that are repeated verbatim at different times. Examples include theme music, news clips and of course commercials. It should be fairly easy to detect these repetitions, although the problem can rapidly become intractable for a large archive. This project will examine several different approaches to detecting these repeated episodes, as well as looking at the robustness (misses versus false alarm performance) of the techniques.

Last updated: $Date: 2000/09/11 17:37:32 $
Dan Ellis <[email protected]>