INTERSPEECH conferences emphasize interdisciplinary approaches addressing all aspects of speech science and technology, ranging from basic theories to advanced applications.
Yinghao and Ali focus on voice conversion models that convert one voice into another voice without requiring any training labels and therefore can directly take any new recording and convert it to another person's voice without the need for training data.
Voice Conversion is the task of changing the voice of one speaker (source) to another speaker (target) while keeping the content of speech constant. This problem requires training on massive amount of data. This means having many voice samples from both the source and target speakers with various content so the model is able to learn to separate the features that are related to the content of the speech from the features that are related to the voice of the speakers. The most important achievements in this work is the quality of the converted voice which is very natural and hard to distinguish from real recordings. Our model can also perform voice conversion regardless of the language and accent of the source or target speakers, it can even make the target speaker sing! Such a model can help with second language learners or those who aspire to learn singing. The model can also convert a plain reading voice into emotional speech that uses falsetto, which enables applications such as movie dubbing.