Dan Ellis : Sound Examples :

Meeting Recorder

This page contains one example of sound examples recorded from multiple channels at the same time. This is an interesting case because sometimes it allows us to distinguish between different sound sources on the basis of the different timing and amplitude levels at each sensor.

These particular sound examples are derived from the ICSI Meeting Recorder project. This is an ongoing project to see how speech processing technology can help with managing recordings of conventional meetings.

The data here were collected from tabletop microphones during a meeting with six participants. This excerpt lasts 5 minutes (300 seconds), and occurred 17 minutes into the recording. It was chosen because it contains a lot of overlap between the different speakers.

Each soundfile is a stereo WAV file at a 16 kHz sampling rate, so each file has 16000x300 = 4,800,000 stereo frames. You can read the data into Matlab with e.g.

[d,r] = wavread('pzm12.wav',[(60*16000)+1 (70*16000)])

to read just 10 seconds of data from 1 minute into the excerpt. The data stored in d will have 160,000 rows and two columns, with each column being one of the stereo channels.

There are three soundfiles:

pdaLR.wav contains the two channels recorded from a dummy PDA on the tabletop. The mics are about 3 inches apart. The setup is rather low-quality, with a lot of low-frequency background noise.
pzm12.wav contains the first two of the PZM microphones that were spaced along the center of the table. These two were among all the participants and have the strongest signal. They were about 3 foot apart.
pzm34.wav contains the remaining two PZM microphone signals. These mics were further down the table away from the participants and have weaker signals. They were also about 3 foot apart, and 3 foot away from the first PZM mics.

All channels were recorded sample-synchronously. However, because of the limitations with the software, there may be a fixed skew between each of the channels; normally, this is a multiple of 21¹/₃ samples, and might be a 64 sample delay on the first of each stereo pair (and something larger than this between the different files, although still a multiple of the basic skew).

These examples have been processed to remove this effect, so the channels are all exactly synchronized (I believe). They have also been high-pass filtered to remove the sub-10Hz air conditioning noise which actually dominates the energy of the raw signals:

pzm12a.wav is the processed version of pzm12.
pzm34a.wav corresponds to pzm34.

The file transcript.txt contains a human-generated transcript of the speech in the meeting. Each line has the form:

start duration channel words

where each field is separated by a tab character. start and duration are in seconds, relative to the start of the soundfile; channel is 0 to 5 for the 6 different speakers; words is the transcript of what was said. You can read the file into Matlab with the following command:

[start,duration,channel,words] = textread('transcript.txt','%f%f%f%s','whitespace','\t');

which reads each line as three numbers and a string, with the separator between fields being TAB only (so the words are not broken up into separate fields). Each returned variable start, duration ,channel and words is a column vector with one value per line in the file.

(The original transcription included annotation of various nonspeech sounds such as inbreaths by particular speakers, or background sounds, which had a channel of "default". The version transcript-all.txt includes all these extra events.)

Headset channels

For comparison, here are the headset (close-talking) mic channels for the 5 participants over the same 5 min excerpt. This is as close as we can get to the ``pure'' source signals.

Last updated: $Date: 2005/05/04 04:11:32 $

Dan Ellis <[email protected]>