EE6820 Project:
Segmentation / Classification of Soccer Field
Audio (2)
1. Speech
signal:
Lowest formant of human speech usually < 1KHz (
A2=880Hz, reasonable no one can be higher than that), so compute the energy in
this band, normalize with regard to the maximum value in the whole analysis time
range (assume the recording gain is constant within one game). Clustered peaks
in low-band energy indicate speech.
| %
spectrogram [b, f, t] = specgram(d,128,fs); energy = abs(b).^2; % devide into 8 energy bands for i=1:8 k = (i-1)*floor(size(b,1)/8)+1; e(i,:)= sum(energy(k:k+8,:)); end % low-band energy figure(1); subplot(312); plot(t, e(1,:)/max(e(1,:)),'b'); |
Sometimes we have fricatives when vowels are absent, which is shown as high-band noise-like peaks in spectrogram. Similarly, we compute the energy in this band and normalize with regard to all-time-high. This is particularly useful when crowd noise level is not suppressed in low-band.
| %
high-band energy figure(1); subplot(313); plot(t, e(8,:)/max(e(8,:)),'g'); from energy.m |
2. Crowd noise:
Seems to be more clustered in the mid-band, even if it
is the aggregation of many human vocal sound.
Recording condition, channel condition, etc.
?
3.
Preliminary Result :
All of the four segments contain dominant speech
and crowd noise, it is easy to distinguish between the two by simply looking at
them.
The low-band peaks align with speech
signal (vowels), so do high-band peaks with fricatives.
The high-band energy in the band-limited
audio track (Argentina) doesn't mean anything, for the fricatives are lost
(but human listeners can still make out). This noise-like high-band energy
signal won't affect our judgment if next step of data processing is carefully
conducted.
Figure 1. Spectrogram and
low-band / high-band energy contour of different soccer field audio
The darker the spectrogram, the larger the amplitude.
Click on graph to see full resolution.
| Costa | Argentina |
�� |
![]() |
| News2 | Korea |
��![]() |
![]() |
4. Next
(a) Process the
energy contour, automatically output segmentation boundary.
Something similar to what John Saunders[ref2]
did with his ZCR contour?
Determine proper time resolution for analysis.
Too
small ---- miss classify commentator punctuations as crowd noise, may have
negative effect in excited/unexcited crowd classification.
Too
large ---- is more likely to miss small segments of excited crowd noise if
it happens almost the same time with excited speech.
(b) Combine the low-band and high-band
contour (make judgments like " if fricatives are more likely to occur
between vowels, then this segment is speech")
(c) Is this enough? Is low-band
crowd noise always suppressed during recording?
(d) As computation is concerned, time
domain process seems more efficient.
Soccer
project (1) (2)
| EE6820
Home EE6820
project page xlx
Audio
Last Update: 04/01/2001 03:50:33 PM
<[email protected]>