TY - GEN
T1 - Sparse representation of phonetic features for voice conversion with and without parallel data
AU - Cicman, Berrak
AU - Li, Haizhou
AU - Tan, Kay Chen
N1 - Funding Information:
This research is supported by Ministry of Education, Singapore AcRF Tier 1 NUS Start-up Grant FY2016. Berrak Sisman is also funded by SINGA Scholarship under A*STAR Graduate Academy.
Publisher Copyright:
© 2017 IEEE.
PY - 2018/1/24
Y1 - 2018/1/24
N2 - This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.
AB - This paper presents a voice conversion framework that uses phonetic information in an exemplar-based voice conversion approach. The proposed idea is motivated by the fact that phone-dependent exemplars lead to better estimation of activation matrix, therefore, possibly better conversion. We propose to use the phone segmentation results from automatic speech recognition (ASR) to construct a sub-dictionary for each phone. The proposed framework can work with or without parallel training data. With parallel training data, we found that phonetic sub-dictionary outperforms the state-of-the-art baseline in objective and subjective evaluations. Without parallel training data, we use Phonetic PosteriorGrams (PPGs) as the speaker-independent exemplars in the phonetic sub-dictionary to serve as a bridge between speakers. We report that such technique achieves a competitive performance without the need of parallel training data.
KW - phonetic exemplars
KW - Phonetic PosteriorGrams
KW - sparse representation
KW - Voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85046631146&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2017.8269002
DO - 10.1109/ASRU.2017.8269002
M3 - Conference article published in proceeding or book
AN - SCOPUS:85046631146
T3 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
SP - 677
EP - 684
BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
Y2 - 16 December 2017 through 20 December 2017
ER -