In many audio signals including speech
and musical instruments, there is high correlation
between adjacent frames of their spectral
representation. Our approach consists of exploiting
this correlation so that explicit models
are required for those frames that cannot be accurately
predicted from their context.
Our model captures the general properties
of such audio sources by modeling the evolution
of their harmonics components. Using
the common source-filter model for such signals, we
devise a layered generative model that
describes these two components in separate layers:
one for the excitation harmonics, and
another for resonances such as vocal tract formants.
Our approach explicitly models the self-similarity
and dynamics of each layer by fitting the
log-spectra in frame t with a set of
transformations of the log-spectra in frame t-1. As a result,
we do not require separate states for
every possible spectral configuration, but only a limited
set of "sharp" states that can still
cover the full spectral variety of a source via such
transformations. This approach is thus
suitable for any time series data with high correlation
between adjacent observations.
We will first introduce a model that
captures the spectral deformation field of the speech
harmonics, and show how this can be
exploited to interpolate missing observations. Then, we
introduce the two-layer model that separately
models the deformation fields for harmonic
and formant resonance components, and
show that such a separation is necessary to
accurately describe speech signals through
examples of the missing data scenario with
one and two layers.
Then we will present the complete model
including the two deformation fields and the
"sharp" states. This model, with only
a few states and both deformation fields, can
accurately reconstruct the signal.
Finally, we briefly describe a range
of existing applications including semi-supervised source
separation, and discuss the model's
possible application to unsupervised source separation.