"Deformable Spectrograms"
Manuel Reyes-Gomez           Nebojsa Jojic DanielP.W.Ellis
ColumbiaUniversity        Microsoft Research ColumbiaUniversity
 

IV.- TWO LAYER SOURCE-FILTER TRANFORMATIONS
      VIDEO 4, VIDEO 5, VIDEO 6.

Many sound sources, including voiced speech, can be successfully regarded as the convolution of a
broad-band source excitation, such as the pseudo-periodic glottal flow,  and a time-varying
resonant  filter, such as the vocal tract, that `colors' the excitation to produce speech sounds or
other distinctions.

When the excitation has a spectrum consisting of well-defined harmonics, the overall spectrum is
in essence the resonant frequency response sampled at the frequencies of the harmonics.

Figure 6, shows a spectrogram where the harmonics and the formants are clearly shown.

Since convolution of the source with the filter in the time domain corresponds to multiplying
their spectra in the Fourier domain, or adding in the log-spectral domain. Hence, we model
the log-spectra X as the sum of variables F and H, which explicitly model the formants and
the harmonics of the speech signal. The source-filter transformation model is based on two
additive layers of the deformation model described above, as illustrated in figure 7.


                                                                        Figure 7

Variables F and H in the model are hidden, while, as before, X can be observed or hidden.
The symmetry between the two layers is broken by using different parameters in each,
chosen to suit the particular dynamics of each component.
We use transformations with a larger support in the formant layer compared to the
harmonics layer. Since all harmonics tend to move in the same direction, we enforce smoother
transformation maps on the harmonics layer by using potential transition matrices with higher
self-loop probabilities.

Figure 8, shows the decomposition of a speech signal into harmonics and formants
components, illustrated as the means of the posteriors of the continuous hidden variables
in each layer.


                                                              Figure 8

The decomposition is not perfect. Since we separate the components in terms of differences
in dynamics, this criteria becomes insufficient when both layers have similar motion.
Separation improves modeling precisely when each component has a different motion, and
when the motions coincide is not really important in which layer the source is actually captured.
However, some applications may require a better separation, which is part of our current
research.

Once inference has been done in the model "expected" transformation maps for both, the harmonics and the formants
layers can be found.

Figure 9 b) shows the spectrogram of part a) with a "missing" region; notice that the two
layers have distinctly different motions.  In c) the region has been filled via inference
in a single-layer model;  notice that since the formant motion does not follow the harmonics,
the formants are not captured in the reconstruction. In d) the two layers are first decomposed
and then each layer is filled in; the figure shows the addition of the filled-in version in each layer.


                                                                                           Figure 9

Description of Video 4.

This video illustrates the need of a model that takes into account the production model for
voiced speech, the video shows a single layer "fill in" application, where the "missing"
harmonics are correctly regenerated while failing to regenerate some of the "missing"
formants. The panels display the same information as in the previous two videos.

VIDEO 4.- One layer; Missing Data Application;
Introducing the need for two layers.

CLICK ON THE SCREEN TO ACTIVE THE VIDEO !

Description of Video 5.

This video performs harmonics/formants separation. The screen has 5 panels.  Panel 1, shows
the original spectrogram.  Panel 2 displays the estimated means for the harmonics posteriors.
Panel 3 displays the estimated means for the formants posteriors. Panel 4 displays the
transformation maps for the harmonics layer.Panel 5 displays the transformation maps for the
formants layer.

VIDEO 5.- Harmonics/Formants Separation.

CLICK ON THE SCREEN TO ACTIVE THE VIDEO !


Description of Video 6.

This video performs the two layers missing data application.  Here each layer is "filled in"
independently.As before, the transformation maps for each layer are reestimated using uniform
messages from the local likelihood potentials on the "missing" regions. The complete spectrogram
is "filled in" with the summation of the "filled in" versions on each layer.

The screen has 5 panels.  Panel 1 shows the complete spectrogram with the "filled in" values.
Panel 2 displays the "filled in" values for the harmonics layers. Panel 3 displays the "filled in"
values for the formants layer.Panel 4 displays the transformation maps for the harmonics layer.
Panel 5 displays the transformation maps for the formants layer.

VIDEO 6.- Two layers Missing Data Application.

CLICK ON THE SCREEN TO ACTIVE THE VIDEO !


 

BACK TO INDEX
PREVIOUS
NEXT