"Deformable Spectrograms"
Manuel Reyes-Gomez           Nebojsa Jojic DanielP.W.Ellis
ColumbiaUniversity        Microsoft Research ColumbiaUniversity

V.-  MATCHING-TRACKING MODEL
      VIDEO 7, VIDEO 10.

Prediction of frames from their context is not always possible such as when  there are transitions
between silence and speech or transitions between voiced and unvoiced speech, so we need a
set of states to represent these  unpredictable frames explicitly.
We will also need a second ``switch'' variable that will decide when to  ``track'' (transform) and
when to ``match'' the observation with a state.  Figure 10, shows a graphical representation of this
model. At each time frame, discrete variables St and Ct are connected to all frequency bins in that
frame. St is a uniformly-weighted GMM containing the means and the variances of the states to
model. Variable Ct takes two values: When it is equal to 0, the model is in ``tracking mode''; a value
of 1 designates ``matching mode''.


                                                                               Figure 10

The potentials between variables St ,  Ct , Xt, Ft and Ht are described in the paper.
The posteriors for variables  St   and Ct (Q(St ) and Q(Ct )) are obtained using the
belief propagation rules, Q(Ct = 0) is large if  the match between the current frame
and its prediction from the context it larger than the match between the current
frame and the means set of the GMM. In early iterations when the means are still
quite random,  the match between the means set and the observations is pretty low,
making  Q(Ct = 0) large with the result that the explicit states are never used.
To prevent this we start the model with large variances, which will result in
non-zero values for Q(Ct=1), and hence the explicit states will be learned.
As we progress, we start to learn the variances. When the variances are large, most
of the frames are used in learning the means sets, typically resulting in a set of similar
"blurry" states, however as the variances start to be learned and become smaller
Q(Ct=1) takes non-zero values only in very few frames, selecting those few frames to
learn the means set, resulting in "sharper" means.

We start with a relative large number of means, but this becomes much smaller once
the variances are learned; The resulting states typically consist of single  frames at
discontinuities as intended. Figure 11 a) shows the frames chosen for  the spectrogram
on figure 1.  The signal reconstruction is done by using the correspondent chosen frames
on the two layers.  The reconstruction is simply another instance of inferring missing
values, except the motion fields are not reestimated since we have the true ones.
Figure 11 shows several stages of the reconstruction.

                                                                                                              Figure 11

Videos 7 and 10  provide a good insigth of how this model operates.

Description of Video 7.
In this video, the matching-tracking model is applied to a single source signal. The video is
divided in two parts: the estimation of the model and the reconstruction of the signal spectrogram
from the model parameters.  As we mentioned above, some care with the model variances has to be
taken to ensure the adequate estimation of the model parameters. We keep the variances large for
the early stages of the model estimation, while learning them in the later stages. We present and
briefly discuss three screen shots for this video.

During model estimation, the demo displays information in 4 panels. Panel 1 displays the signal to
be modeled. Panel 4 displays the means set for the states of variable StPanel 2 displays the mean
of the most likely state for each frame. Panel 3 shows the values of the posteriors for Ct = 1
i.e. the probability that the model matches frame t with a state from variable St, rather than tracking,
(estimating) the frame from its context.

In panel 3, frames in dark blue correspond to those frames where Q(Ct=1)  is close to zero, meaning
that the frame is tracked rather than matched to a state, frames in dark red correspond to those frames
where Q(Ct=1) is close to one, meaning that the frame is used to learn the states during the model
estimation. Frames with colors in between are partially tracked and partially matched to a state and they
are also partially used to learn the states during the model estimation, notice that this is the case during
the early stages of the model estimation procedure (screen 1).This results in a set of "blurry" means (panel 4).
Once we started to learn the variances in the later stages of the model estimation procedure (screen 2),
posteriors Q(Ct=1)  become very "peaky," meaning that the frames are either "tracked" or "matched",
notice that very few frames are matched and represented in the means set. In fact, the final means are
composed by single frames, which constitute a very "sharp" set of means, (panel 4). Also notice that
even though we started with a set of 15 means only 9 are actually used. When the model estimation
finishes (screen 2), panel 5 shows the chosen frames.

The signal reconstruction is done (screen 3) by using the correspondent chosen frames on the two layers.
The reconstruction is simply another instance of inferring missing values, except the motion fields are not
reestimated since we have the true ones. Panel 2, shows the reconstruction on the harmonics layer, panel
3 shows the reconstruction of the formants layer, and panel 1 shows the complete reconstruction obtained by
adding the reconstruction in each layer.
 

VIDEO 7.- Matching-Tracking Model of a single source signal and
signal reconstruction from the model.

CLICK ON THE SCREEN TO ACTIVE THE VIDEO !

 

BACK TO INDEX
PREVIOUS
NEXT