V.- MATCHING-TRACKING MODEL
VIDEO 7, VIDEO
10.
Prediction of frames from their context
is not always possible such as when there are transitions
between silence and speech or transitions
between voiced and unvoiced speech, so we need a
set of states to represent these
unpredictable frames explicitly.
We will also need a second ``switch''
variable that will decide when to ``track'' (transform) and
when to ``match'' the observation with
a state. Figure 10, shows a graphical representation of this
model. At each time frame, discrete
variables St and Ct are connected to all frequency
bins in that
frame. St is a uniformly-weighted
GMM containing the means and the variances of the states to
model. Variable Ct takes
two values: When it is equal to 0, the model is in ``tracking mode''; a
value
of 1 designates ``matching mode''.
Figure 10
The potentials between variables St
, Ct , Xt, Ft and Ht are
described in the paper.
The posteriors for variables St
and Ct (Q(St ) and Q(Ct )) are obtained
using the
belief propagation rules, Q(Ct
=
0) is large if the match between the current frame
and its prediction from the context
it larger than the match between the current
frame and the means set of the GMM.
In early iterations when the means are still
quite random, the match between
the means set and the observations is pretty low,
making Q(Ct = 0) large
with the result that the explicit states are never used.
To prevent this we start the model with
large variances, which will result in
non-zero values for Q(Ct=1),
and hence the explicit states will be learned.
As we progress, we start to learn the
variances. When the variances are large, most
of the frames are used in learning the
means sets, typically resulting in a set of similar
"blurry" states, however as the variances
start to be learned and become smaller
Q(Ct=1) takes non-zero values
only in very few frames, selecting those few frames to
learn the means set, resulting in "sharper"
means.
We start with a relative large number
of means, but this becomes much smaller once
the variances are learned; The resulting
states typically consist of single frames at
discontinuities as intended. Figure
11 a) shows the frames chosen for the spectrogram
on figure 1. The signal reconstruction
is done by using the correspondent chosen frames
on the two layers. The reconstruction
is simply another instance of inferring missing
values, except the motion fields are
not reestimated since we have the true ones.
Figure 11 shows several stages of the
reconstruction.
Figure 11
Videos 7 and 10 provide a good insigth of how this model operates.
Description
of Video 7.
In this video, the matching-tracking
model is applied to a single source signal. The video is
divided in two parts: the estimation
of the model and the reconstruction of the signal spectrogram
from the model parameters. As
we mentioned above, some care with the model variances has to be
taken to ensure the adequate estimation
of the model parameters. We keep the variances large for
the early stages of the model estimation,
while learning them in the later stages. We present and
briefly discuss three screen shots for
this video.
During model estimation, the demo displays
information in 4 panels. Panel 1 displays the signal to
be modeled. Panel 4 displays the means
set for the states of variable St. Panel
2 displays the mean
of the most likely state for each frame. Panel
3 shows the values of the posteriors for Ct = 1
i.e. the probability that the model
matches frame t with a state from variable St, rather than tracking,
(estimating) the frame from its context.
In panel 3, frames in dark blue correspond
to those frames where Q(Ct=1) is close to zero, meaning
that the frame is tracked rather than
matched to a state, frames in dark red correspond to those frames
where Q(Ct=1) is close to
one, meaning that the frame is used to learn the states during the model
estimation. Frames with colors in between
are partially tracked and partially matched to a state and they
are also partially used to learn the
states during the model estimation, notice that this is the case during
the early stages of the model estimation
procedure (screen 1).This results
in a set of "blurry" means (panel 4).
Once we started to learn the variances
in the later stages of the model estimation procedure (screen 2),
posteriors Q(Ct=1)
become very "peaky," meaning that the frames are either "tracked" or "matched",
notice that very few frames are matched
and represented in the means set. In fact, the final means are
composed by single frames, which constitute
a very "sharp" set of means, (panel 4). Also notice that
even though we started with a set of
15 means only 9 are actually used. When the model estimation
finishes (screen 2), panel 5 shows the
chosen frames.
The signal reconstruction is done (screen
3) by using the correspondent chosen frames on the two layers.
The reconstruction is simply another
instance of inferring missing values, except the motion fields are not
reestimated since we have the true ones.
Panel 2, shows the reconstruction on the harmonics layer, panel
3 shows the reconstruction of the formants
layer, and panel 1 shows the complete reconstruction obtained by
adding the reconstruction in each layer.
VIDEO 7.- Matching-Tracking
Model of a single source signal and
signal reconstruction
from the model.