VII.- DOMINANT SPEAKER UNSUPERVISED
SOURCE SEPARATION.
The separation of speech mixtures into its individual
sources using a single microphone
is a very hard and interesting problem. Current approaches
include attempts to segregate
a time-frequency representation on a bin-by-bin basis.
Each bin is subjected to analysis and tagged as belonging
to one of the individual sources.
The large combinatorial space created by the analysis
at such fine resolution
poses a great challenge to systems attempting to do such
a separation.
On the other hand, other research had shown that an intelligible
separation can be done by
grouping those regions of the spectrogram where a given
speaker is more dominant than the others
The problem is how to find those speaker-dominant regions.
We resolve this problem using a
subband version of our Matching-Tracking Model.
We first introduced the need for such a model.
- Matching-Tracking
Model on Composed Signals.
VIDEO
10
The right hand of figure 15, illustrates
the entropy of the distribution inferred by the system
for each transformation variable on
a composed signal. The third pane on the figure shows
"entropy edges", boundaries of high
transformation uncertainty. With some exceptions, these
boundaries correspond to transitions
between silence and speech, or when occlusion between
speakers starts or end. Similar edges
are also found at the transitions between voiced and
unvoiced speech. High entropy at these
points indicates that the model does not know what to
track and cannot find a good transformation
to predict the following frames. These "transition"
points are captured by the state variables,
when composed signals are modeled using the
matching-tracking
model, the state nodes normally capture the first frame of the "new
dominant" speaker, the third pane on
the figure also shows the frames chosen as states by
the system.
Figure 15
Description
of Video 10.
In this video, the matching-tracking
model is applied on the composed signal from figure 15.
The demo displays information in 4 panels.
Panel 1 displays the signal to
be modeled. Panel 4 displays the means
set for the states of variable St. Panel 2 displays the mean
of the most likely state for each frame.
Panel 3 shows the values of the posteriors for Ct = 1,
i.e. the probability that the model
matches frame t with a state from variable St, rather than tracking,
(estimating) the frame from its context.
The video screen shot shows the chosen
frames once the estimation of the model parameters
is done, we edit the screen shots with
black lines to better identify the chosen frames in the
composed signal, showing that the chosen
frames have the previously mentioned characteristics.
VIDEO 10.- Matching-Tracking
Model on composed signals.
CLICK ON THE SCREEN TO
ACTIVE THE VIDEO !
The next figure shows the selected frames for another
composed signal:
Notice that for both composed signals. Even though the
model does find the frames where a
``new source enters'' the scene or when an ``old
one leaves'' it, in general the segmentation
does not produce regions beloging to a single source.
This is so, because the magnitude of the
interference is not uniform across all the spectrum.
Therefore we require a model that can
``track'' in some sections of the spectra while ``matching''
in others.
-
Subband Matching-Tracking Model for Composed Signals.
Our goal is to find regions where a single source dominates
the mixture by finding switches of
the dominant source. We then extend our matching and
tracking model conceived as a single
source model to a subband version to accomodate the modelling
of signals with multiple
sources.
The next figure, shows the graphical representation
of the subband version.
The tracking part of the model is done as in its full
spectra version, the matching part is
divided in R subbands. Each subband has its own "state"
and "switch" variables.
-
Dominant Speaker Segmentation Results
The next figure shows the subband frames selected from
this version of the model for the
previous composed signal.
Example of a dominant speaker regions segmented signal
The model detects the changes on dominant speaker as well
as the transition between
speech and silence and voiced and unvoiced speech there
are also a few false positives.
The false positives correspond to mismatches within the
same speaker like when there
are abrupt variations in the motion of both layers.
We ran experiments on 200 artificially mixed mixtures
of two speakers:
50 female-female, 50 male-male, 50 male-female and 50
same speaker with different
utterances.
Since we are artificially mixing the signals we can find
the dominant
speaker boundaries. (See paper for details).
We defined three types of regions:
R1 regions dominated by speaker 1 with dominance of over
3db, R2 the corresponding
regions for speaker 2 and R0 are the regions that neither
of the speakers dominates.
We then define two types of dominant speaker boundaries:
hard boundaries correspond
to the boundaries between regions R1 and R2 and soft
boundaries that correspond to
regions R0 found between R1 and R2 regions. We
also detect SIL regions where both
speakers had low energy.
The following figure shows the R0,R1,R2 and SIL regions
for the above composed signal.
Brown correspond to the R1 regions, Orange corresponds
to the R2 regions, Dark Blue
corresponds to the R0 regions and Light Blue corresponds
to SIL regions.
We require our model to detect a switch in either of the
two frames bordering the hard edges
and to detect a switch anywhere on the regions defined
by the soft edges.
The segmentation results on the 200 artifiacially
mixed signals using the subband
deformable spectrograms segmentation can be observed
in the following table.
| Type of Mixture |
Female-Female |
Male-Male |
Female-Male |
Same Speaker |
| Recall |
96.64% |
97.94% |
97.51% |
96.88% |
| Precision |
62.80% |
62.37% |
61.14% |
69.18% |
Dominant
Speaker Segmentation Results Using Deformable Spectrograms
The recall values are high without substantial differences
between the different kind
of mixtures. The model does well regardless of the nature
of the speakers because it
discovers interruptions in the energy pattern of the
signal without relying on any source
dependant features. The precision results are not as
good. This is because transitions
between voiced and unvoiced data for the same speaker
are also detected as well as
mismatches within the same speaker like when there are
abrupt variations in the
motion of both layers.
For comparison purposes we implemented a pitch based bayesian
information criteria
segmentation. (Check the paper and some of the references
on it for details)
The results obtained are the following:
| Type of Mixture |
Female-Female |
Male-Male |
Female-Male |
Same Speaker |
| Recall |
68.47% |
66.19% |
71.46% |
61.49% |
| Precision |
39.94% |
38.92% |
42.04% |
36.55% |
Dominant
Speaker Segmentation Results Using a Pitch Based Segmentation Scheme
Since the deformable spectrograms based segmentation has
high recall values we can be pretty
certain that the signal is segmented in dominant speaker
regions. Even with a few false positives
clustering these regions is a task several degrees simpler
than clustering individual bins.
-
Spectral Clustering of Dominant Speaker Regions with Examples.
We first cluster regions within the same
subband and later we cluster regions between bands.
The entries for the affinities matrix A for the i
and
j
regions
is defined as:
Aij =
exp (-| Dij| ^2/2s^2)
for i != j; and Aij
= 0 for i = j.
Dij is the summation of the n time-frequency patches taken
from regions i and j with the minimum
distances divided by n. When clustering
within subbands we used n=3, when clustering between
bands we used n=10; This similarity matrix does
not depend on pitch, therefore even regions with
similar pitch can be clustered if they show other sources
of dissimilarity like prosody or style.
EXAMPLE: (Click on figure to listen to the signals)
This composed signal:
is segmented in dominant speaker regions as above.
The regions are then clustered in three regions, one for
each speaker plus silence.
The resulting clusters for each speaker are the following:
Cluster for Speaker 1
Cluster for Speaker 2
-
Clustering with a Speech Recognizer
When the different sound sources have distinctive, low-level
properties, can be relatively
straightforward to identify the correct grouping of
regions. If, however, these gross
differences are not available -- for instance, if two
relatively similar voices are interfering --
a more complex set of constraints need to be employed.
As an extreme example, if the
different groupings of cells lead to reconstructed voices,
it may be that certain groupings give rise
to clearly intelligible speech, whereas incorrect groupings
that mix up energy from multiple
sources resynthesize to gibberish. Although
this seems like a sophisticated judgement, we
can in fact use the relatively strong model of likely
speech signals implicit within a traditional speech
recognition system, to distinguish these cases.
This is part of the idea behind the `speech fragment
decoder' [Barket et al.] , which aims to recognize speech
that has had portions of its time-frequency
surface corrupted by interference. The speech fragment
decoder uses missing-data recognition --
integration of likelihood values over the possible ranges
of unknown or distorted dimensions -- to do a
joint search for both the most likely utterance (the
conventional speech recognition problem) and the
most likely `missing data mask'. These likelihoods
are easily defined in terms of the distribution
models (probability of observations given the underlying
state) at the heart of speech recognition, but
comparing all possible missing-data masks can quickly
become intractible. If, however, the set of
alternative data masks can be drastically cut down by
dividing time-frequency into large regions, and
requiring that all cells in a given region receive the
same label, recognition again becomes
feasible. This is part of our current research.
- Interpolation of Masked Regions with Examples.
Once we have cluster the segments, we can use the model
to infer the masked sections.
Here we keep the transformation maps of both layers for
the regions that the desired speaker dominates,
while relearning the transformation maps for the regions
that wered masked by the other speaker. The reconstruction here is not
freely done as in the missing information examples shown before. Since
we do
have constraints of what the data can be given that we
can observe the mixed signal on those regions.
Moreover restrictions on the structure that the reconstructed
signal may take have to be inforced to
prevent the reconstruction to follow the structure of
the competing speaker.
The following figures show the sequence of signals from
the original composed signal to the individual
speaker signals with the estimation of their masked parts.

Original Composed Signal
Dominant region segmentation

Cluster for Speaker 1
Cluster for Speaker2

Speaker 1 with inferred
masked regions Speaker2 with inferred
masked regions

Original Speaker 1
Original Speaker 2
BACK
TO INDEX
PREVIOUS