Dan Ellis: Research
Spectrograms of two people speaking the same phrase, with the changing alignments of their corresponding phones shown in-between. This image is also available as a larger, scalable PDF file.
Voice transformation is the process of modifying an utterance from a
particular speaker to make it sound as if it was spoken by a
specific different speaker. This transformation might involve modifications
to any aspect of the signal that carries speaker identity, for instance:
- Formant spectra i.e. the coarse spectral structure associated
with the different phones in the speech. This is the most common
target of voice transformation algorithms, which work by constructing
a map between the formant spectra of the two voices (represented, for
instance, in the cepstral domain). See for example
Kain & Macon, 2001 "Design and evaluation of a voice conversion algorithm..." (which I got from Alex Kain's web site.
- Excitation function i.e. the `fine' spectral detail held
in the residual excitation after the broad spectra shape has been
flattened. There is evidence that this provides useful information
for speaker identification (e.g. indirect evidence from automatic
speaker identification based on whitened residual:
Yegnanarayana et al., Source and system features for speaker recongnition ....
- Prosodic features i.e. aspects of the speech that occur over
timescales larger than individual phonemes, such as fundamental
frequency (pitch), timing, and their patterns of variation.
- Mannerisms such as particular word choice or preferred phrases,
or all kinds of other very high-level behavioral characteristics that
we notice about our interlocutors.
These attributes cover a very wide range of properties. It's probably
true that for all of them, we don't really understand which properties
are most important for speaker identity, or how to modify them. However,
for the lower-level characteristics, good progress has been made.
What is the interest in speaker transformation? As an end in itself, it has
some limited use for anonymity and entertainment applictions, but this is
only a small part. By attempting both to analyze and then to synthesize
voices such that the message and speaker identity can be completely
separated, we will inevitably end up with a better understanding of how
both these aspects are represented in the signal. Thus, by looking at
voice transformation, we can learn things useful for:
- Speech coding: The problem of analyzing voice into a
suitably abstract domain so that it can be regenerated with a
convincing different speaker identity would give the kind of high-level
description that could be very useful for high-complexity, ultra-low
bandwidth voice communications. For instance, rather than resynthesizing
as a different speaker, resynthesize as the same speaker, but transmit
only the global speaker description, and whatever non-speaker-related
information residual is left for the actual speech.
- Speech synthesis: Implicit in the vision of the future
speech coder is a speech synthesis algorithm capable of reproducing
voice that sounds like any real speaker, based only on some high-level
speaker characteristics parameters. Although this is not necessarily
the same as a pure text-to-speech (since we are starting with a spoken
utterance, not a text stream), it could amount to something very
similar (i.e. the ultra-low bandwidth transmission might be something
very similar to the text of the message).
- Speech recognition: Also implicit in the coding scenario
is the idea of an analysis module able to identify and separate
speaker characteristics and content parameters. But even without
the (far-fetched?) transmission application, looking closely at
the speech signal and the nature of speech variation should help
us to improve recognition. For instance, current speech recognition
systems use almost no information other than the coarse spectral
structure (i.e. formant spectrum), and achieve speaker independence
essentially by averaging over all speakers in their training sets -
in contrast to the sense of `homing in' on a particular speaker's
voice or accent that we have as human listeners. A good understanding
and model of what attributes we are learning when we `home in'
could lead to much more detailed and accurate speech signal models
for recognizer front-ends.
Timing -- i.e. the duration of each successive speech sound -- is an
interesting example. Speech recognizer front-ends usually include some kind
of duration modeling (to discount very unlikely events, such as a vowel
that lasts 10 seconds), but the models are very weak because, again, they
are averaged across entire training sets which include all kinds of fast
and slow speech. The problem is made harder because people can vary their
rate of speech quite signficantly within a single phrase. However, it is
unlikely that there is no useful information to be obtained from
speech segment durations -- but to find out what it is, we need to look
at the variations in speech timing and gain a better understanding of
the existing patterns and available constraints.
This project might follow the outline below:
- Duplication of spectrally-based voice transformation results,
for instance by joint modeling of time-aligned parameters for pairs
of speakers, perhaps in LSP space.
- Investigation into generalizing these kinds of transformations based
on small amounts of example data, or on non-matching data.
- Examination of phone timing data: spread of phone duration, differences
in average and spread of duration variation between speakers,
patterns of speech rate within phrases.
- Assess viability of automatic transformation between timing patterns
of different speakers.
- Similar explorations for fundamental frequency.
- Application of some of the results for improved speech recognizer
Last updated: $Date: 2001/05/29 00:05:04 $
Dan Ellis <firstname.lastname@example.org>