EE6820 Project: Segmentation / Classification of Soccer Field Audio (1)

Dataset and General Observations

1. Dataset Description

Soccer: MPEG-1 system stream, 1.5~1.6 Mb/sec, CBR
Video -- 352*240, 30 frame/second
Audio -- sampling rate 44100Hz, Stereo (down-sampled to 16Kb/s mono before processing)

No.	Name	Source	Total Length	Language	Content description	Sound Characteristics
1	Costa	FOX Sports	01:29:00	English	Last 75 minutes of 2002 U-Champion Latin America Qualify, Costa Rica vs Guatemala, 8:1	Clear Commentary, weak crowd noise
2	Argentina	FOX Sports	01:29:40	English	Last 75 minutes of Football Agentino, Los Andes vs. River Plate, 2:0	Clear Commentary, moderate crowd noise, band-limited to ~5.5KHz (original production or broadcast settings)
3	News2	MPEG-7	00:15:00	Spanish	Part of a news program��	Clear commentary strong crowd noise
4	Korea	MPEG-7	00:54:53	Korean	cannot understand (plays are usually short, the teams seems pretty rusty though)	Commentary: heavy utterance Crowd: very noisy and excited, with drums and shouting

2. Observations

Looking for hints of useful features intuitively from the waveform and spectrogram below. Things worth trying out:
(a)    Time domain
        Amplitude information --- total energy in a short time-window; mean and variance of amplitude;
        Zero-crossing rate ---- and its 1st~3rd order moments.
        ... ...
(b)    Frequency domain
        Subband energy in spectrum (03/31/01)
        Subband energy distribution along frequency axis (frequency "discreteness" for distinguishing formant structure)
(c)    Cepstrum and more complicated features
        MFCC, features incorporating auditory model, ...
(d)    Posteriors coming out of a speech recognizer
         How this would perform under noisy environment of different level, and how this would perform with an unknown language ...
(e)    Try to find formant structure using pitch tracking.
        Useful if excited/unexcited commentator classification is desired. Overall very complicated, reported accuracy ~75% (49 out of 66).
        This may get more confused by some pseudo-formant structure in crowd noise (see Figure 2).

Figure 1. Wave form and Spectrogram of Different Soccer Field Audio
The darker the spectrogram, the larger the amplitude. Click on graph to see full resolution.

Costa	Argentina
��
News2	Korea
��

Figure 2. Formant-like peaks in crowd noise
(from Wavesurfer screen dump)
Both of the segments have speech in the beginning and crow noise later on
After all, crowd noise mainly consists of multiple human vocal sound, will this confuse pitch tracker?

News2
Costa

Soccer project (1) (2) | EE6820 Home EE6820 project page xlx Audio
Last Update: 04/01/2001 03:52:37 PM
<[email protected]>