Project Wish List

Last Update: 03/20/2001 10:46:37 AM
<[email protected]>

EE6820 project page EE6820 Home xlx Audio

Segmentation / Classification of Soccer Field Audio

The whole picture: segmentation as part of soccer video analysis

Classify field audio intuitively (establish ground truth)

Soccer filed audio is usually noisy, and usually is the mixture of two or more basic types.

Type No.	1	2	3
Audio	Dominant Speech (Korean)	Crowd Noise	Special Event (whistle)
Description	Mixture of foreground speech and background crowd noise, with visible formant structure.	Complex vocal and non-vocal noise mixtures, display more homogeneity than 1. Yet crowd noise has variations within itself, e.g. excited or unexcited, which can hopefully be discriminated by energy etc.	Distinct acoustic event.
Sound Example
Spectrogram

��

In principle, it looks similar to speech/music discrimination (bibliography [2] [9]).
More specifically, this is "dominant speech" vs "compound vocal noise" discrimination + alarm sound detection.
It deals with noisy mixture sound, maybe [10](?) & [8] (the part about detecting baseball hits) is more relevant.
Look for suitable features for classification ([11] [2] [9])
Candidates:
    ZCR and its variance, ratio of "low energy frame", entropy, MFCC, spectral flux, the spectral centroid and its variance, cepstral residue and its variance, peak energy ratio (for whistle detection), spectral roll off point...
1.    Need to eliminate those features only suitable for tonal sounds, but not noise.
2.    How can I have some intuition on each feature is about? (e.g. ZCR gives you the dominant frequency)
3.    Is there any feature extraction package we can download and use? (although it is not too hard to implement some of them in matlab, but that may be repetitive labor)
4.    What is the time scale we need to look at? And so what is the suitable segment size?
5.    Is this too trivial to do? Rui et. al just used energy + MFCC to detect endpoints of speech, but I am not sure if this will work out well in a more noisy environment (like soccer or basketball).
Hard points: Classify speech in noisy mixture? Is this the so called "onset and common period detector" in [12]?
What's next:
Using prosodic cues in speech segments (like or improve what Rui et.al did in ACMMM2000, pitch tracking, probabilistic training and modeling, etc. not easy)
Try classifying excited / unexcited crowd ( kind of easily seen in the spectrogram with regard to the intensity, yet to have any idea how to compute. Is this what Dongqing trying to do?)
See if this kind of classification can be extended to other kinds of sports (basketball, baseball, etc)

Bibliography

[1] Albert S. Bregman " Auditory scene analysis: hearing in complex environments", Thinking in sound: the cognitive psychology of human audition, Oxford University Press, 1993, p10~36
< see reading summary Feb 6 >

[2] John Saunders "Real-time discrimination of broadcast speech/music", ICASSP 96
< see reading summary Feb 8 >

[3] Dellaert, F.; Polzin, T.; Waibel, A. "Recognizing emotion in speech" , ICSLP 1996

[4] Droppo, J, Acero, A, "Maximum A Posterior Pitch Tracking", ICSLP 1998
< see reading summary Feb 26 >

[5] Arons, Barry, "Speechskimmer: A System for Interactively Skimming Recorded Speech", ACM CHI 1997

[6] Chao Wang; Seneff, S., "Robust pitch tracking for prosodic modeling in telephone speech", ICASSP 2000

[7]    Johnathan Foote, "Visualizing music and audio using self-similarity", Proc. ACM Multimedia 1999
        (This paper may seem irrelevant to the topic, yet it's still interesting to read)
        < see reading summary Feb 26    >

[8] Yong Rui, Annop Gupta, Alex Acero "Automatically extracting highlights from TV baseball programs", Proc. ACM Multimedia 2000

[9] Eric Scheirer, Malcolm Slaney, "Construction and evaluation of a robust multi-feature speech/music discriminator", ICASSP 97
< see reading summary week4 >

[10] Ellis, D., & Williams, G., "Speech/music discrimination based on posterior probability features", Proc. Eurospeech-99, Budapest

[11] Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H. "A comparison of features for speech, music discrimination", ICASSP-99

[12] Ellis, D., "Hard problems in computational auditory scene analysis", http://sound.media.mit.edu/~dpwe/writing/hard-probs.html

1. Video syntax analysis via audio cues

Type 1    audio-visual information centric, e.g. movie
                try to segment consistent chunk of audio data (speech/music) to form a complete video skim
                specific points of interest: silence detection, music/speech classification

Type 2    video info centric, e.g. soccer video
                3 kinds of audio: acclamation, whistle, commentary
                the presence of the first 2 kinds of events are usually clues of important happening or transition points in the game;
                the change of the commentaries (pitch and speed change, narrator stop) are also useful.
                Problem 1: how to segment theses 3 types of sound?

2.Music watermarking

3.Constrained Music Analysis and synthesis