Sparse representation of phonetic features for voice conversion with and without parallel data

Berrak Cicman, Haizhou Li, Kay Chen Tan

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

37 Citations (Scopus)

Abstract

This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages677-684
Number of pages8
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 24 Jan 2018
Externally publishedYes
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 16 Dec 201720 Dec 2017

Publication series

Name2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
Volume2018-January

Conference

Conference2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Country/TerritoryJapan
CityOkinawa
Period16/12/1720/12/17

Keywords

  • phonetic exemplars
  • Phonetic PosteriorGrams
  • sparse representation
  • Voice conversion

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Sparse representation of phonetic features for voice conversion with and without parallel data'. Together they form a unique fingerprint.

Cite this