Spectrograms of two people speaking the same phrase, with the changing alignments of their corresponding phones shown in-between. This image is also available as a larger, scalable PDF file.

Voice transformation is the process of modifying an utterance from a particular speaker to make it sound as if it was spoken by a specific different speaker. This transformation might involve modifications to any aspect of the signal that carries speaker identity, for instance:

These attributes cover a very wide range of properties. It's probably true that for all of them, we don't really understand which properties are most important for speaker identity, or how to modify them. However, for the lower-level characteristics, good progress has been made.

What is the interest in speaker transformation? As an end in itself, it has some limited use for anonymity and entertainment applictions, but this is only a small part. By attempting both to analyze and then to synthesize voices such that the message and speaker identity can be completely separated, we will inevitably end up with a better understanding of how both these aspects are represented in the signal. Thus, by looking at voice transformation, we can learn things useful for:

Timing -- i.e. the duration of each successive speech sound -- is an interesting example. Speech recognizer front-ends usually include some kind of duration modeling (to discount very unlikely events, such as a vowel that lasts 10 seconds), but the models are very weak because, again, they are averaged across entire training sets which include all kinds of fast and slow speech. The problem is made harder because people can vary their rate of speech quite signficantly within a single phrase. However, it is unlikely that there is no useful information to be obtained from speech segment durations -- but to find out what it is, we need to look at the variations in speech timing and gain a better understanding of the existing patterns and available constraints.

This project might follow the outline below:

