"Deformable Spectrograms"
Manuel Reyes-Gomez    Nebojsa Jojic    Daniel P.W. Ellis
Columbia University   Microsoft Research   Columbia University
 

VII.-  DOMINANT SPEAKER UNSUPERVISED SOURCE SEPARATION.
The separation of speech mixtures into its individual sources using a single microphone
is a very hard and interesting problem. Current approaches include attempts to segregate
a time-frequency representation on a bin-by-bin basis.
Each bin is subjected to analysis and tagged as belonging to one of the individual sources.
The large combinatorial space created by the analysis at such fine resolution
poses a great challenge to systems attempting to do such a separation.
On the other hand, other research had shown that an intelligible separation can be done by
grouping those regions of the spectrogram where a given speaker is more dominant than the others
The problem is how to find those speaker-dominant regions. We resolve this problem using a
subband version of our Matching-Tracking  Model.  We first introduced the need for such a model.

Matching-Tracking Model on Composed Signals.
VIDEO 10
The right hand of figure 15, illustrates the entropy of the distribution inferred by the system
for each transformation variable on a composed signal. The third pane on the figure shows
"entropy edges", boundaries of high transformation uncertainty. With some exceptions, these
boundaries correspond to transitions between silence and speech, or when occlusion between
speakers starts or end. Similar edges are also found at the transitions between voiced and
unvoiced speech. High entropy at these points indicates that the model does not know what to
track and cannot find a good transformation to predict the following frames. These "transition"
points are captured by the state variables, when composed signals are modeled using the
matching-tracking model, the state nodes normally capture the first frame of the "new
dominant" speaker, the third pane on the figure also shows the frames chosen as states by
the system.


                                                                Figure 15

Description of Video 10.
In this video, the matching-tracking model is applied on the composed signal from figure 15.
The demo displays information in 4 panels. Panel 1 displays the signal to
be modeled. Panel 4 displays the means set for the states of variable St. Panel 2 displays the mean
of the most likely state for each frame. Panel 3 shows the values of the posteriors for Ct = 1,
i.e. the probability that the model matches frame t with a state from variable St, rather than tracking,
(estimating) the frame from its context.

The video screen shot shows the chosen frames once the estimation of the model parameters
is done, we edit the screen shots with black lines to better identify the chosen frames in the
composed signal, showing that the chosen frames have the previously mentioned characteristics.
 

VIDEO 10.- Matching-Tracking Model on composed signals.

CLICK ON THE SCREEN TO ACTIVE THE VIDEO !

The next figure shows the selected frames for another composed signal:

Notice that for both composed signals. Even though the model does find the frames where a
 ``new source enters'' the scene or when an ``old one leaves'' it, in general the segmentation
does not produce regions beloging to a single source. This is so, because the magnitude of the
interference is not uniform across all the spectrum. Therefore we require a model that can
``track'' in some sections of the spectra while ``matching'' in others.

- Subband Matching-Tracking Model for Composed Signals.

Our goal is to find regions where a single source dominates the mixture by finding switches of
the dominant source. We then extend our matching and tracking model conceived as a single
source model to a subband version to accomodate the modelling of signals with multiple
sources.

The next  figure, shows the graphical representation of the subband version.

The tracking part of the model is done as in its full  spectra version, the matching part is
divided in R subbands. Each subband has its own "state" and "switch" variables.
 

- Dominant Speaker Segmentation Results

The next figure shows the subband frames selected from this version of the model for the
previous composed signal.

                   Example of a dominant speaker regions segmented signal
 

The model detects the changes on dominant speaker as well as the transition between
speech and silence and voiced and unvoiced speech there are also a few false positives.
The false positives correspond to mismatches within the same speaker like when there
are abrupt variations in the motion of both layers.

We ran experiments on 200 artificially mixed mixtures of two speakers:
50 female-female, 50 male-male, 50 male-female and 50 same speaker with different
utterances.

Since we are artificially mixing the signals we can find the dominant
speaker boundaries.  (See paper for details).  We defined three types of regions:
R1 regions dominated by speaker 1 with dominance of over 3db, R2 the corresponding
regions for speaker 2 and R0 are the regions that neither of the speakers dominates.

We then define two types of dominant speaker  boundaries: hard boundaries correspond
to the boundaries between regions R1 and R2 and soft boundaries that correspond to
regions R0 found between  R1 and R2 regions. We also detect SIL regions where both
speakers had low energy.

The following figure shows the R0,R1,R2 and SIL regions for the above composed signal.

Brown correspond to the R1 regions, Orange corresponds to the R2 regions, Dark Blue
corresponds to the R0 regions and Light Blue corresponds to SIL regions.

We require our model to detect a switch in either of the two frames bordering the hard edges
and to detect a switch anywhere on the regions defined by the soft edges.

 The segmentation results on the 200 artifiacially mixed signals using the subband
deformable spectrograms segmentation can be observed in the following table.
 
 

 Type of  Mixture Female-Female Male-Male Female-Male Same Speaker
Recall 96.64% 97.94% 97.51% 96.88%
Precision 62.80% 62.37% 61.14% 69.18%
            Dominant Speaker Segmentation Results Using Deformable Spectrograms
 

The recall values are high without substantial differences between the different kind
of mixtures. The model does well regardless of the nature of the speakers because it
discovers interruptions in the energy pattern of the signal without relying on any source
dependant features. The precision results are not as good. This is because transitions
between voiced and unvoiced data for the same speaker are also detected as well as
mismatches within the same speaker like when there are abrupt variations in the
motion of both layers.

For comparison purposes we implemented a pitch based bayesian information criteria
segmentation. (Check the paper and some of the references on it for details)
The results obtained are the following:
 

 Type of  Mixture Female-Female Male-Male Female-Male Same Speaker
Recall 68.47% 66.19% 71.46% 61.49%
Precision 39.94% 38.92% 42.04% 36.55%
            Dominant Speaker Segmentation Results Using a Pitch Based Segmentation Scheme

Since the deformable spectrograms based segmentation has high recall values we can be pretty
certain that the signal is segmented in dominant speaker regions. Even with a few false positives
clustering these regions is a task several degrees simpler than clustering individual bins.

- Spectral Clustering of Dominant Speaker Regions with Examples.
 We first cluster regions within the same subband and later we cluster regions between bands.
The entries for the affinities matrix A for the i and j regions is defined as:

Aij = exp (-| Dij| ^2/2s^2)  for  i != j; and  Aij  = 0 for i = j.

Dij is the summation of the n time-frequency patches taken from regions i and j with the minimum
distances divided by n.  When clustering within subbands we used n=3, when clustering between
bands we used n=10; This similarity matrix does not depend on pitch, therefore even regions with
similar pitch can be clustered if they show other sources of dissimilarity like prosody or style.

EXAMPLE: (Click on figure to listen to the signals)

This composed signal:

 is segmented in dominant speaker regions as above.

The regions are then clustered in three regions, one for each speaker plus silence.

The resulting clusters for each speaker are the following:

                             Cluster for Speaker 1


                               Cluster for Speaker 2

- Clustering with a Speech Recognizer

When the different sound sources have distinctive, low-level properties, can be relatively
straightforward to identify the correct grouping of  regions.  If, however, these gross
differences are not available -- for instance, if two relatively similar voices are interfering --
a more complex set of constraints need to be employed.  As an  extreme example, if the
different groupings of cells lead to reconstructed voices, it may be that certain groupings give rise
to clearly intelligible speech, whereas incorrect groupings that  mix up energy from multiple
sources resynthesize to gibberish.   Although this seems like a sophisticated judgement, we
can in fact use the relatively strong model of likely speech signals implicit  within a traditional speech
recognition system, to distinguish these cases.  This is part of the idea behind the `speech fragment
decoder' [Barket et al.] , which aims to recognize speech that  has had portions of its time-frequency
surface corrupted by interference.  The speech fragment decoder uses missing-data recognition --
integration of likelihood values over the possible ranges of unknown or distorted dimensions -- to do a
joint search for both the most likely utterance (the conventional speech recognition problem) and the
most likely `missing data mask'.  These likelihoods are easily defined in terms of the distribution
models (probability of observations given the underlying state)  at the heart of speech recognition, but
comparing all possible missing-data masks can quickly become intractible.  If, however, the set of
alternative data masks can be drastically cut down by dividing time-frequency into large regions, and
requiring that all cells in a given region receive the same label, recognition again becomes
feasible.  This is part of our current research.

- Interpolation of Masked Regions with Examples.
Once we have cluster the segments, we can use the model to infer the masked sections.

Here we keep the transformation maps of both layers for the regions that the desired speaker dominates,
while relearning the transformation maps for the regions that wered masked by the other speaker. The reconstruction here is not freely done as in the missing information examples shown before. Since we do
have constraints of what the data can be given that we can observe the mixed signal on those regions.
Moreover restrictions on the structure that the reconstructed signal may take have to be inforced to
prevent the reconstruction to follow the structure of the competing speaker.

The following figures show the sequence of signals from the original composed signal to the individual
speaker signals with the estimation of their masked parts.

            Original Composed Signal                   Dominant region segmentation

               Cluster for Speaker 1                              Cluster for Speaker2

        Speaker 1 with inferred masked regions       Speaker2 with inferred masked regions

                      Original Speaker 1                                  Original Speaker 2
BACK TO INDEX
PREVIOUS