1. Speech signal:
Lowest formant of human speech usually < 1KHz ( A2=880Hz, reasonable no one can be higher than that), so compute the energy in this band, normalize with regard to the maximum value in the whole analysis time range (assume the recording gain is constant within one game). Clustered peaks in low-band energy indicate speech.

% spectrogram
[b, f, t] = specgram(d,128,fs);
energy = abs(b).^2;
% devide into 8 energy bands
for i=1:8
k = (i-1)*floor(size(b,1)/8)+1;
e(i,:)= sum(energy(k:k+8,:));
end
% low-band energy
figure(1); subplot(312);
plot(t, e(1,:)/max(e(1,:)),'b');

Sometimes we have fricatives when vowels are absent, which is shown as high-band noise-like peaks in spectrogram. Similarly, we compute the energy in this band and normalize with regard to all-time-high. This is particularly useful when crowd noise level is not suppressed in low-band.

2.    Crowd noise:
        Seems to be more clustered in the mid-band, even if it is the aggregation of many human vocal sound.
        Recording condition, channel condition, etc. ?

3.    Preliminary Result :
       All of the four segments contain dominant speech and crowd noise, it is easy to distinguish between the two by simply looking at them.
       The low-band peaks align with speech signal (vowels), so do high-band peaks with fricatives.
       The high-band energy in the band-limited audio track (Argentina) doesn't mean anything, for the fricatives are lost (but human listeners can still make out). This noise-like high-band energy signal won't affect our judgment if next step of data processing is carefully conducted.

Figure 1. Spectrogram and low-band / high-band energy contour of different soccer field audio
The darker the spectrogram, the larger the amplitude. Click on graph to see full resolution.

Costa	Argentina
��
News2	Korea
��

(a)        Process the energy contour, automatically output segmentation boundary.
             Something similar to what John Saunders[ref2] did with his ZCR contour?
             Determine proper time resolution for analysis.
             Too small ---- miss classify commentator punctuations as crowd noise, may have negative effect in excited/unexcited crowd classification.
             Too large ---- is more likely to miss small segments of excited crowd noise if it happens almost the same time with excited speech.
(b)        Combine the low-band and high-band contour (make judgments like " if fricatives are more likely to occur between vowels, then this segment is speech")
(c)        Is this enough? Is low-band crowd noise always suppressed during recording?
(d)        As computation is concerned, time domain process seems more efficient.

Low-band and High-band Energy