IV.- TWO LAYER SOURCE-FILTER TRANFORMATIONS
VIDEO 4, VIDEO
5, VIDEO 6.
Many sound sources, including voiced
speech, can be successfully regarded as the convolution of a
broad-band source excitation,
such as the pseudo-periodic glottal flow, and a time-varying
resonant filter, such as
the vocal tract, that `colors' the excitation to produce speech sounds
or
other distinctions.
When the excitation has a spectrum consisting
of well-defined harmonics, the overall spectrum is
in essence the resonant frequency response
sampled at the frequencies of the harmonics.
Figure 6, shows a spectrogram where the harmonics and the formants are clearly shown.
Since convolution of the source with
the filter in the time domain corresponds to multiplying
their spectra in the Fourier domain,
or adding in the log-spectral domain. Hence, we model
the log-spectra X as the sum of variables
F and H, which explicitly model the formants and
the harmonics of the speech signal.
The source-filter transformation model is based on two
additive layers of the deformation model
described above, as illustrated in figure 7.
Figure 7
Variables F and H in the model are hidden,
while, as before, X can be observed or hidden.
The symmetry between the two layers
is broken by using different parameters in each,
chosen to suit the particular dynamics
of each component.
We use transformations with a larger
support in the formant layer compared to the
harmonics layer. Since all harmonics
tend to move in the same direction, we enforce smoother
transformation maps on the harmonics
layer by using potential transition matrices with higher
self-loop probabilities.
Figure 8, shows the decomposition of
a speech signal into harmonics and formants
components, illustrated as the means
of the posteriors of the continuous hidden variables
in each layer.
Figure 8
The decomposition is not perfect. Since
we separate the components in terms of differences
in dynamics, this criteria becomes insufficient
when both layers have similar motion.
Separation improves modeling precisely
when each component has a different motion, and
when the motions coincide is not really
important in which layer the source is actually captured.
However, some applications may require
a better separation, which is part of our current
research.
Once inference has been done in the model "expected"
transformation maps for both, the harmonics and the formants
layers can be found.
Figure 9 b) shows the spectrogram of
part a) with a "missing" region; notice that the two
layers have distinctly different motions.
In c) the region has been filled via inference
in a single-layer model; notice
that since the formant motion does not follow the harmonics,
the formants are not captured in the
reconstruction. In d) the two
layers are first decomposed
and then each layer is filled in; the
figure shows the addition of the filled-in version in each layer.
Figure 9
This video illustrates the need of a
model that takes into account the production model for
voiced speech, the video shows a single
layer "fill in" application, where the "missing"
harmonics are correctly regenerated
while failing to regenerate some of the "missing"
formants. The panels display the same
information as in the previous two videos.
VIDEO 4.- One layer;
Missing Data Application;
Introducing the need
for two layers.
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !
This video performs harmonics/formants
separation. The screen has 5 panels. Panel 1, shows
the original spectrogram. Panel
2 displays the estimated means for the harmonics posteriors.
Panel 3 displays the estimated means
for the formants posteriors. Panel
4 displays the
transformation maps for the harmonics
layer.Panel 5 displays the transformation
maps for the
formants layer.
VIDEO 5.- Harmonics/Formants Separation.
CLICK ON THE SCREEN TO
ACTIVE THE VIDEO !
Description of Video 6.
This video performs the two layers missing
data application. Here each layer is "filled in"
independently.As
before, the transformation maps for each layer are reestimated using uniform
messages from the local likelihood potentials
on the "missing" regions. The complete spectrogram
is "filled in" with the summation of
the "filled in" versions on each layer.
The screen has 5 panels. Panel
1 shows the complete spectrogram with the "filled in" values.
Panel 2 displays the "filled in" values
for the harmonics layers. Panel
3 displays the "filled in"
values for the formants layer.Panel
4 displays the transformation maps for the harmonics layer.
Panel 5 displays the transformation
maps for the formants layer.
VIDEO 6.- Two layers Missing Data Application.
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !