VI.- RESULTS ON SPECIFIC APPLICATIONS.
The
phonetic distinctions at the basis of speech recognition reflect vocal
tract filtering of glottal
RESULTS ON A SPEECH
RECOGNITION TASK.
To convert the formant transformation
maps into features suitable for the recognizer,
We used the Aurora-2 noisy digits
database for our experiments.
Features derived from formant transformation
maps obtained using (one,two) layers are
Using Perceptual Linear Prediction
(PLP) features combined with FTM1 features, the
Using FTM2 features do not improve
the performance of the recognizer. This may be
FEATURES
FOR SPEECH RECOGNITION.
excitation.
In particular, the dynamics of formants (vocal tract resonances) are known
to be
powerful
"information-bearing elements" in speech. We believe the formant
transformation
maps
may be a robust discriminative feature to be use in conjunction with traditional
features in
speech
recognition systems, particularly in noisy conditions, given that the belief
propagation
algorithm
"enforces" a dynamic structure on the transformation maps.Since
often the structure
of
the speech signal is higher than the noise one, the belief propagation
algorithm finds
transformation
maps that are consistent with the dynamics of the speech rather than the
one of
the
noise. An example of the robustness of the formants transformation map
to noisy can be
observed
in figure 14, where the formant transformation maps for clean and noisy
versions
of
the same signal are shown.
We computed two sets of transformation
maps: one using formants obtained with our
model, and another with formants
obtained using cepstral smoothing. For the latter we
only require a single layer model.
We then use features derived from
these maps in combination with standard features
in a speech recognizer to test
if the maps can contribute new information not captured
by the regular features.
we applied mel-scale filtering
and a discrete cosine transform to decorrelate and
reduce the dimensionality of the
final feature vectors.
Results at different SNR levels
are shown in the following table:
referred to as (``FTM1'',``FTM2'').
recognizer performance remains
about the same as the standard features alone when
the signal has high SNR values,
but when the SNR decreases the new features improve
the word error rate (WER) by as
much as 19.5\% relative for the 15~dB SNR (``SNR15'')
condition. We believe that when
the signals are relatively clean, a local analysis of the
energy dynamics, is sufficient
to effectively disambiguate the words. However as the
interference becomes larger a more
global model of the energy dynamics, such as the
formants transition maps, can reduce
the influence of local energy variations due to the
noise.
because the layers cannot be separated
when the two layers have parallel dynamics.
However, independent modeling of
transformation maps for both layers is important for
other applications such as source
separation.