Manuel Reyes-Gomez
Nebojsa Jojic
Daniel P.W. Ellis
Columbia University
Microsoft Research Columbia University
! WATCH THE TEN VIDEOS BELOW !
I.- INTRODUCTION
In many audio signals including speech and musical instruments,
there is high correlation
between adjacent frames of their spectral representation.
Our approach consists in exploiting
this correlation so that explicit models are required
for those frames that cannot be accurately
predicted from their context.
Our model captures the general properties of such audio
sources by modeling the evolution
of their harmonics components. Using the common source-filter
model for such signals, we
devise a layered generative model that describes these
two components in separate layers:
one for the excitation harmonics, and another for resonances
such as vocal tract formants.
Our approach explicity models the self-similarity and
dynamics of each layer by fitting the
log-spectra in frame t with a set of transformations
of the log-spectra in frame t-1. As a result,
we do not require separate states for every possible
spectral configuration,but only a limiter
set of "sharp" states that can still cover the full spectral
veriety of a source via such
transformations. This approach is thus suitable for any
time series data with high correlation
between adajacent observations.
We will first introduce a model that captures the spectral
deformation field of the speech
harmonics, and show how this can be explioted to interpolate
mising observations. Then, we
introduce the two-layer model that separately models
the deformation fields for harmonic
and formant resonance components, and show that such
a separation is necessary to
accurately describe speech signals through examples of
the missing data scenario with
one and two layers
Then we will present the complete model including the
two deformation fields and the
"sharp" states This model, with only a few states
and both deformation fields, can
accurately reconstruct the signal.
Finally, we briefly describe a rang of exisitng applications
including semi-supervised source
separation, and discuss the model's possible application
to unsupervised source separation.
II.- SPECTRAL DEFORMATION MODEL
Figure 1 shows a narrow band spectrogram representation
of a speech signal, where each
column depicts the energy content across frequency in
a short-time window, or time-frame.
The value in each cell is actually the log-magnitude
of the short-time Fourier transform.
Figure 1
Using the subscript C to designate current and P to indicate
previous, the model predicts
a patch of Nc time-frequency bins centered at the kth
frequency bin of frame t as a
``transformation'' of a patch of Np bins around
the kth bin of frame t-1.
Figure 1, shows an example with Nc = 3 and Np = 5 to illustrate
the intuition behind this
approach. The selected patch in frame t can be seen as
a close replica of an upward shift
of part of the patch highlighted in frame t-1.
This ``upward'' relationship can be captured by a
transformation matrix such as the one shown in the figure.
The patch in frame t-1 is larger than the patch in frame
t to permit both upward and
downward motions.
The generative graphical model for a single layer is depicted in figure 2.
Figure 2: a)Graphical model; b) Graphical simplification
X nodes correspond to the observations, and T nodes to
the tranformations at each frequency
bin. At each bin, the local likelihood potentials involve:
the Nc bins used in the current frame,
the Np bins used in the previous frame and the set of
all possible transformation matrix defined
by T. Please read the paper for complete details.
Inference is efficiently performed via loopy belief propagations.
Once the posteriors of the
transformation nodes are estimated, we can find
the "expected" transformation maps an
appealing description of the harmonic's dynamics, as
can be observed in figure 3.
In these panels, the links between three specific time-frequency
bins and their corresponding
transformations on the map are highlighted. Bin 1 is
described by a steep downward
transformation, while bin 3 also has a downward motion
but is described by a less steep
transformation, consistent with the dynamics visible
in the spectrogram. Bin 2, in other hand,
is described by a steep upwards transformation.
Figure 3.- Tranformation Map. Green : Identity transform;
Yellow/Orange : Upward Motion, darker is steeper.
Blue : Downward motion; darker is steeper.
DEMO INTRODUCTION
We have built a real time demo that performs a variety of applications using this model.
The user can change the different parameters of the model
on the user interfase, (Figure 4).
There are several panels and function buttons that we
will explain using different applications.
The information displayed on each panel changes with
each application.
We will present ten short videos of the demo for
each application. Before each video we
will describe the application, the information displayed
in each panel and the functionality of
the buttons.
Description of Video 1.
We first present an instance on the demo performing basic
estimation of the harmonics
transformation maps followed by a harmonics tracking
application.
Figure 4, shows a typical "screen shot" of the demo for
this application. The figure displays
three panels. Panel 1: displays the signal to be processed;
Panel 2: Shows the most likely
tranformation obtained from the local likelihood potential,
here as in the transformations maps,
the color relates to the motion present in the signal,
however the structure is not clearly
defined as in the transformations maps, also notice
the total lack of a clear structure on the
silent regions of the signal.; Panel 3: Shows the transformation
maps obtained after each
complete iteration.
Each complete iteration consists on complete belief propagation
messages passes through all
the vertical chains, each vertical chain consist in all
the coefficients for a given frame, followed
by the complete belief propagation passes on all the
horizontal chains, each horizontal frame
consist in all the frames for a given coefficient, the
belief propagation rules for this chains can be
implemented using efficient forward/backward, upward/downward
recurssions, see extended
paper for details. The strength of the belief propagation
in each direction is controlled by transition
potentials in each direction. Parameters "Ver. Factor"
and "Hor. Factor" affect the probability of
switching to a different transform, a higher value on
this factor results in "smother" transformation
patterns on that direction. The video also shows the
effect of changes on those
factors.
Once the transformation maps are estimated, some interesting
applications can be performed,
like tracking harmonics. The user "clicks" in a certain
region of the spectrogram, and if the
"Track H" button is pushed, the demo shows the history
of that particular time-frequency bin.
VIDEO 1. - Harmonics transformations
maps and harmonics
tracking application.
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !
III.- MISSING DATA SCENARIO
If parts of the spectrogram are corrupted or missing like
in figure 5 a), we can "fill in" those values
by considering the correspondent parts of variables X
as hidden, and propagating continuous
"belief" messages from and to the hidden nodes, we use
again belief propagation. The posteriors
of the continuous hidden nodes is approximated using
a gaussian distributions. The missing values
are then "filled in" with the means of their posterior
distributions. Check the full sequence of
iterations on Video 2.
CLICK ON SIGNALS TO HEAR THE RECONSTRUCTED SPEECH !


Figure 5
a) Original Occluded Signal
b) Filled in signal after 5 iterations
c) Filled in signal after 10 iterations


d) Filled in signal after 20 iterations
e) Filled in signal after 30 iterations
f) Filled in signal after 40 iterations
Description of Video 2.
The video presents the missing data application using
a single layer model. This time four panels are
used. Panel 1 is first used to defined the missing
regions, once the "Fill In" button is pressed the
missing regions are estimated with the "filled in" values.
Panel 2, as before shows the local
likelihood potentials, when the observation of a particular
time-frequency bin is missed, no reliable
local likelihood potential can be estimated, and therefore
any local "belief" regarding the identity
of the correct transformation can be transmited to the
correspondent transformation node. Then,
we set the correspondent messages from the local likelihood
potentials to the transformation nodes
as uniform. The transformation posteriors (Panel 3) are
estimated as before using the "new" local
likelihood potentials, therefore the transformation posteriors
on the "missing" regions are driven
entirely by the tranformation "beliefs" of their reliable
neighbors. We keep the transformation
posteriors fixed and then we start propagating the continuous
messages and the missing values begin
to be filled, once we have some "meaningful" information
on the missing regions we start to
calculate the local potentials on the missing regions
using the "filled in" values and the
transformation posteriors are frequently reestimated.
Panel 4, show the original signal for
comparison purposes only.
VIDEO 2.- Missing Data Application.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !
Description of Video 3.
This is one is similar to the previous one, it also shows
the "fill in" application with a single
layer. The purpose of this video is to illustrate that
belief propagation is not "magic", and
that when a single missing region is too big, the reliable
neighbors are to apart to propagate
the right transformations "beliefs" to their missing
neighbors. However the perceptual
results obtained are significantly better than the signal
with "missing regions".
VIDEO 3.- Missing Data Application. Severe Case
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !
IV.- TWO LAYER SOURCE-FILTER TRANFORMATIONS
Many sound sources, including voiced speech, can be successfully
regarded as the convolution of a
broad-band source excitation, such as the pseudo-periodic
glottal flow, and a time-varying
resonant filter, such as the vocal tract,
that `colors' the excitation to produce speech sounds or
other distinctions.
When the excitation has a spectrum consisting of well-defined
harmonics, the overall spectrum is
in essence the resonant frequency response sampled at
the frequencies of the harmonics.
Figure 6, shows an spectrogram where the harmonics and the formants are clearly shown.
Since convolution of the source with the filter in the
time domain corresponds to multiplying
their spectra in the Fourier domain, or adding in the
log-spectral domain. Hence, we model
the log-spectra X as the sum of variables F and H, which
explicitly model the formants and
the harmonics of the speech signal. The source-filter
transformation model is based on two
additive layers of the deformation model described above,
as illustrated in figure 7.
Figure 7
Variables F and H in the model are hidden, while, as before,
X can be observed or hidden.
The symmetry between the two layers is broken by using
different parameters in each,
chosen to suit the particular dynamics of each component.
We use transformations with a larger support in the formant
layer compared to the
harmonics layer. Since all harmonics tend to move in
the same direction, we enforce smoother
transformation maps on the harmonics layer by using potential
transition matrices with a higher
self-loop probabilities.
Figure 8, shows the decomposition of a speech signal into
harmonics and formants
components, illustrated as the means of the posteriors
of the continuous hidden variables
in each layer.
Figure 8
The decomposition is not perfect; Since we separate the
components in terms of differences
in dynamics, this criteria becomes insufficient when
both layers have similar motion.
Separation improves modeling precisely when each component
has a different motion, and
when the motions coincide is not really important in
which layer the source is actually captured.
However, some applications may require a better separation,
which is part of our current
research.
Figure 9; b) shows the spectrogram of part a) with a "mising"
region; notice that the two
layers have distinctly different motions. In c)
the region has been filled via inference
in a single-layer model; Notice that since the
formant motion does not follow the harmonics,
the formants are not captured in the reconstruction.
In d) the two layers are first decomposed
and then each layer is filled in; the figure shows the
addition of the filled-in version in each layer.
Figure 9
Description of Video 4.
This video illustrates the need of a model that takes
in account the production model for
voiced speech, the video shows a single layer "fill in"
application, where the "missing"
harmonics are correctly regenerated while falling to
regenerate some of the "missing"
formants. The panels display the same information as
in the previous two videos.
VIDEO 4.- One layer; Missing
Data Application;
Introducing the need for two
layers.
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !
Description of Video 5.
This video performs harmonics/formants separation. The
screen has 5 panels. Panel 1, shows
the original spectrogram; Panels 2: Displays the
estimated means for the harmonics posteriors
Panels 3: Displays the estimated means for the formants
posteriors; Panel 4: Displays the
transformation maps for the harmonics layer; Panel 5:
Displays the transformation maps for the
formants layer.
VIDEO 5.- Harmonics/Formants Separation.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !
Description of Video 6.
This video performs the two layers missing data application.
Here each layer is "filled in"
independently, as before the transformation maps for
each layer are reestimated using uniform
messages from the local likelihood potentials on the
"missing" regions. The complete spectrogram
is "filled in" with the summation of the "filled in"
versions on each layer.
The screen has 5 panels. Panel 1, shows the complete
spectrogram with the "filled in" values;
Panels 2: Displays the "filled in" values for the harmonics
layers; Panels 3: Displays the "filled in"
values for the formants layer; Panel 4: Displays
the transformation maps for the harmonics layer;
Panel 5: Displays the transformation maps for the formants
layer.
VIDEO 6.- Two layers Missing Data Application.
CLICK ON THE SCREEN TO ACTIVE THE VIDEO !
V.- Matching-Tracking Model
Prediction of frames from their context is not always
possible such as when there are transitions
between silence and speech or transitions between voiced
and unvoiced speech, so we need a
set of states to represent these unpredictable
frames explicitly.
We will also need a second ``switch'' variable that will
decide when to ``track'' (transform) and
when to ``match'' the observation with a state.
Figure 10, shows a graphical representation of this
model. At each time frame, discrete variables St
and Ct are connected to all frequency bins in that
frame. St is a uniformly-weighted GMM containing
the means and the variances of the states to
model. Variable Ct takes two values: When
it is equal to 0, the model is in ``tracking mode''; a value
of 1 designates ``matching mode''.
Figure 10
The potentials between variables St ,
Ct , Xt, Ft and Ht are described
in the paper.
The posteriors for variables St
and Ct (Q(St ) and Q(Ct )) are obtained
using the
belief propagation rules, Q(Ct = 0) is large
if the match between the current frame
and its prediction from the context it larger than the
match between the current
frame and the means set of the GMM. In early iterations
when the means are still
quite random, the match between the means set and
the observations is pretty low,
making Q(Ct = 0) large with the result
that the explicit states are never used.
To prevent this we start the model with large variances,
which will result in
non-zero values for Q(Ct=1), and hence the
explicit states will be learned.
As we progress, we start to learn the variances. When
the variances are large, most
of the frames are used in learning the means sets, typically
resulting in a set of similar
"blurry" states, however as the variances start to be
learned and become smaller
Q(Ct=1) takes non-zero values only in very
few frames, selecting those few frames to
learn the means set, resulting in "sharper" means.
We start with a relative large number of means, but this
becomes much smaller once
the variances are learned; The resulting states typically
consist of single frames at
discontinuities as intended. Figure 11 a) shows the frames
chosen for the spectrogram
on figure 1. The signal reconstruction is done
by using the correspondent chosen frames
on the two layers. The reconstruction is simply
another instance of inferring missing
values, except the motion fields are not reestimated
since we have the true ones.
Figure 11 shows several stages of the reconstruction.
Figure 11
Videos 7 and 10 provide a good insigth of how this model operates.
Description of Video 7.
In this video, the matching-tracking model is applied
on a single source signal. The video is
divided in two parts. The estimation of the model and
the reconstruction of the signal spectrogram
from the model parameters. As we mentioned above
some care with the model variances has to be
taken to ensure the adequate estimation of the model
parameters. We keep the variances large for
the early stages of the model estimation, while learning
them in the later stages. We present and
briefly discuss three screen shots for this video.
During model estimation, the demo displays information
in 4 panels. Panel 1, displays the signal to
be modeled. Panel 4 displays the means set for the states
of variable St; Panel 2 displays the mean
of the most likely state for each frame; Panel 3 shows
the values of the posteriors for Ct = 1;
i.e. the probability that the model matchs frame t with
a state from variable St, rather than tracking,
(estimating) the frame from its context.
In panel 3, frames in dark blue correspond to those frames
where Q(Ct=1) is close to zero, meaning
that the frame is tracked rather than matched to a state,
frames in dark red correspond to those frames
where Q(Ct=1) is close to one, meaning that
the frame is used to learn the states during the model
estimation. Frames with colors in between are partially
tracked and partially matched to a state and they
are also partially used to learn the states during the
model estimation, notice that this is the case during
the early stages of the model estimation procedure (screen
1), this results in a set of "blurry" means (panel 4).
Once we started to learn the variances in the later stages
of the model estimation procedure (screeen 2),
posteriors Q(Ct=1) become very "peaky"
meaning that the frames are either "tracked" or "matched",
notice that very few frames are matched and represented
in the means set. In fact, the final means are
composed by single frames, which constitutes a very "sharp"
set of means, (panel 4). Also notice that
even though we started with a set of 15 means only 9
are actually used. When the model estimation
finishs (screen 2), panel 5 shows the chosen frames.
The signal reconstruction is done (screen 3) by using
the correspondent chosen frames on the two layers.
The reconstruction is simply another instance of
inferring missing values, except the motion fields are not
reestimated since we have the true ones. Panel 2, shows
the reconstruction on the harmonics layer, panel
3 shows the reconstruction of the formants layer, Panle
1 shows the complete reconstruction obtained by
adding the reconstruction in each layer.
VIDEO 7.- Matching-Tracking Model
of a single source signal and
signal reconstruction from the
model.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !
VI.- APPLICATIONS.
Formants and Harmonics Tracking.
Analyzing a signal with the two-layer model permits separate
tracking of the harmonic and formant of
any given point in the spectrogram. The user clicks
on the spectrogram to select a bin and the system
reveals the harmonics and formant "history" for that
bin. Figure 12 b) shows an example of harmonics
tracking for the bin chosen in part a); c) shows an example
of formant tracking for the bin chaosen on
part a).
![]()
![]()
a) Chosen Bin
b) Harmonics tracking
c) Formants tracking.
Figure 12
Watch another example on the following video.
Description of Video 8.
The video, displays information in fiive panels. Panel
1 displays the spectrogram of the signal; Panels
2 and 3 show the harmonics and formants layers; Panels
4 and 5 display the harmonics and formants
transformation maps. The user can "click" anywhere on
the spectrogram and then track the
harmonics or fromants "history" by selecting the harmonics
tracking button or the formants tracking
button. The tracking results are displayed on panel 1
and in the correspondent panel for the component
being track.
VIDEO 8.- Formants and Harmonics Tracking.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !
Two Speakers Semi-Supervised Source Separation.
After modeling the input signal, the user "clicks" on
those regions of the spectrogram that he believes
belong to the speaker that is intended to be eliminated.
The system then masks all neighboring
time-frequency bins with the same prominent transformation
as the one of the selected bin, the bins
remaining in the spectrogram belong to the other speaker.
Figure 13, depicts an instance of this
application. Part a) shows the user "clicks" on the spectrogram;
b) shows where the "clicks" map
into the harmonics transformation map, part c) shows
the resultant mask.

a)
b)
c)
Figure 13
Description of Video 9.
This video shows the how the demo generates the example
in figure 13.
VIDEO 9.- Two Speakers Semi-Supervised Source Separation.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !
Missing data interpolation and Harmonics/Formants Separation.
Check videos 2, 3, 5 and 6 for examples of these applications.
Features for Speech Recognition.
The phonetic distinctions at the basis of speech recognition
reflect vocal tract filtering of glottal
excitation. In particular, the dynamics of formants (vocal
tract resonances) are known to be
powerful "information-bearing elements" in speech. We
believe the formant transformation
maps may be a robust discriminative feature to be use
in conjunction with traditional features in
speech recognition systems, particularly in noisy conditions,,
given that the belief propagation
algorithm "inforces" a dynamic structure on the transformation
maps, since often the structure
of the speech signal is higher than the noise one, the
belief propagation algorithm finds
transformation maps that are consistent with the dynamics
of the speech rather than the one of
the noise. An example of the robustness of the formants
transformation map to noisy can be
observed in figure 14, where the formant transformation
maps for a clean and noisy versions
of the same signal are shown.
VII.- Potential Unsupervised Source Separation Application.
The right hand of figure 15, illustrates the entropy
of the distribution inferred by the system
for each tranformation variable on a composed signal.
The third pane on the figure shows
"entropy edges", boundaries of high tranformation uncertainty.
With some exceptions, these
boundaries correspond to transitions between silence
and speech, or when occlusion between
speakers starts or end. Similar edges are also found
at the transitions between voiced and
unvoiced speech. high entropy at these points indicates
that the model does not know what to
track and cannot find a good transformation to predict
the following frames. These "transition"
points are captured by the state variables, when composed
signals are modeled using the
matching-tracking model, the state nodes normally capture
the first frame of the "new
dominant" speaker, the third pane on the figure also
shows the frames chosen as states by
the system.
The source separation problem can be addressed as follows:
When multiple speakers are
present, each speaker will be modeled in its own layer,
further divided into harmonics
and formants layers. the idea is to reduce the transformation
uncertainty at the onset of
occlusions by continuining, the tracking of the "old"
speaker in one layer st the same time
as estimating the initial state of the "old" speaker
in another layer -- a realization of the
"old-plus-new" heuristic from psychoacoustics. This is
part of our current research.
Description of Video 10.
In this video, the matching-tracking model is applied
on the composed signal from figure 15.
The demo displays information in 4 panels. Panel 1, displays
the signal to
be modeled. Panel 4 displays the means set for the states
of variable St; Panel 2 displays the mean
of the most likely state for each frame; Panel 3 shows
the values of the posteriors for Ct = 1;
i.e. the probability that the model matchs frame t with
a state from variable St, rather than tracking,
(estimating) the frame from its context.
The video screen shot shows the chosen frames once the
estimation of the model parameters
is done, we edit the scrren shots with black lines to
better identify the chosen frames in the
composed signal, showing that the chosen frames have
the previously mentioned characterisitcs.
VIDEO 10.- Matching-Tracking Model on composed signals.
CLICK ON THE SCREEN TO ACTIVE THE
VIDEO !