"Deformable Spectrograms"
Manuel Reyes-Gomez           Nebojsa Jojic DanielP.W.Ellis
ColumbiaUniversity        Microsoft Research ColumbiaUniversity

VI.- RESULTS ON SPECIFIC APPLICATIONS.
FEATURES FOR SPEECH RECOGNITION.

The phonetic distinctions at the basis of speech recognition reflect vocal tract filtering of glottal
excitation. In particular, the dynamics of formants (vocal tract resonances) are known to be
powerful "information-bearing elements" in speech. We believe the formant transformation
maps may be a robust discriminative feature to be use in conjunction with traditional features in
speech recognition systems, particularly in noisy conditions, given that the belief propagation
algorithm "enforces" a dynamic structure on the transformation maps.Since often the structure
of the speech signal is higher than the noise one, the belief propagation algorithm finds
transformation maps that are consistent with the dynamics of the speech rather than the one of
the noise. An example of the robustness of the formants transformation map to noisy can be
observed in figure 14, where the formant transformation maps for clean and noisy versions
of the same signal are shown.

RESULTS ON A SPEECH RECOGNITION TASK.
We computed two sets of transformation maps: one using formants obtained  with our
model, and another with formants obtained using cepstral smoothing.  For the latter we
only require a single layer model.
We then use features derived from these maps in combination with standard features
in a speech recognizer to test if the maps can contribute new information not captured
by the regular features.

To convert the formant transformation maps into features suitable for the recognizer,
we applied mel-scale filtering and a discrete cosine transform to decorrelate and
reduce the dimensionality of the final feature vectors.

We used the Aurora-2 noisy digits database for our experiments.
Results at different SNR levels are shown in the following table:
Features
CLEAN
SNR20
SNR15
SNR10
SNR5
PLP12+delta
.94
2.3
4.1
7.9
12.2
PLP12+delta+FTM1
.98
2.3
3.4
6.8
11.1
PLP12+delta+FTM2
1.3
2.5
4.2
9.7
12.5

Features derived from formant transformation maps obtained using (one,two) layers are
referred to as (``FTM1'',``FTM2'').

Using Perceptual Linear Prediction (PLP) features combined with FTM1 features, the
recognizer performance remains about the same as the standard features alone when
the signal has high SNR values, but when the SNR decreases the new features improve
the word error rate (WER) by as much as 19.5\% relative for the  15~dB SNR (``SNR15'')
condition. We believe that when the signals are relatively clean, a local analysis of the
energy dynamics, is sufficient to effectively disambiguate the words. However as the
interference becomes larger a more global model of the energy dynamics,  such as the
formants transition maps, can reduce the influence of local energy variations due to the
noise.

Using FTM2 features do not improve the performance of the recognizer.  This may be
because the layers cannot be separated when the two layers have parallel dynamics.
However, independent modeling of transformation maps for both layers is important for
other applications such as source separation.

BACK TO INDEX
PREVIOUS
NEXT