TY - GEN
T1 - Transformation of prosody in voice conversion
AU - Sisman, Berrak
AU - Li, Haizhou
AU - Tan, Kay Chen
N1 - Funding Information:
This research is supported by Ministry of Education, Singapore AcRF Tier 1 NUS Start-up Grant FY2016. Berrak Sisman is funded by SINGA Scholarship under A*STAR Graduate Academy.
Publisher Copyright:
© 2017 IEEE.
PY - 2018/2/5
Y1 - 2018/2/5
N2 - Voice Conversion (VC) aims to convert one's voice to sound like that of another. So far, most of the voice conversion frameworks mainly focus only on the conversion of spectrum. We note that speaker identity is also characterized by the prosody features such as fundamental frequency (F0), energy contour and duration. Motivated by this, we propose a framework that can perform F0, energy contour and duration conversion. In the traditional exemplar-based sparse representation approach to voice conversion, a general source-target dictionary of exemplars is constructed to establish the correspondence between source and target speakers. In this work, we propose a Phonetically Aware Sparse Representation of fundamental frequency and energy contour by using Continuous Wavelet Transform (CWT). Our idea is motivated by the facts that CWT decompositions of F0 and energy contours describe prosody patterns in different temporal scales and allow for effective prosody manipulation in speech synthesis. Furthermore, phonetically aware exemplars lead to better estimation of activation matrix, therefore, possibly better conversion of prosody. We also propose a phonetically aware duration conversion framework which takes into account both phone-level and sentence-level speaking rates. We report that the proposed prosody conversion outperforms the traditional prosody conversion techniques in both objective and subjective evaluations.
AB - Voice Conversion (VC) aims to convert one's voice to sound like that of another. So far, most of the voice conversion frameworks mainly focus only on the conversion of spectrum. We note that speaker identity is also characterized by the prosody features such as fundamental frequency (F0), energy contour and duration. Motivated by this, we propose a framework that can perform F0, energy contour and duration conversion. In the traditional exemplar-based sparse representation approach to voice conversion, a general source-target dictionary of exemplars is constructed to establish the correspondence between source and target speakers. In this work, we propose a Phonetically Aware Sparse Representation of fundamental frequency and energy contour by using Continuous Wavelet Transform (CWT). Our idea is motivated by the facts that CWT decompositions of F0 and energy contours describe prosody patterns in different temporal scales and allow for effective prosody manipulation in speech synthesis. Furthermore, phonetically aware exemplars lead to better estimation of activation matrix, therefore, possibly better conversion of prosody. We also propose a phonetically aware duration conversion framework which takes into account both phone-level and sentence-level speaking rates. We report that the proposed prosody conversion outperforms the traditional prosody conversion techniques in both objective and subjective evaluations.
UR - http://www.scopus.com/inward/record.url?scp=85046632090&partnerID=8YFLogxK
U2 - 10.1109/APSIPA.2017.8282288
DO - 10.1109/APSIPA.2017.8282288
M3 - Conference article published in proceeding or book
AN - SCOPUS:85046632090
T3 - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
SP - 1537
EP - 1546
BT - Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017
Y2 - 12 December 2017 through 15 December 2017
ER -