Pseudo-speaker Distribution Learning in Voice Anonymization

Liping Chen, Wenju Gu, Kong Aik Lee, Wu Guo, Zhen Hua Ling

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

State-of-the-art framework for voice anonymization involves disentangling the speaker attributes from the original speech, substituting its representation with that of a pseudospeaker, and generating the anonymized utterance. The generation of the pseudo-speaker representation typically involves two steps: selecting a cohort of speakers from a pre-defined speaker pool and then constructing the pseudo-speaker representation based on the selected cohort speakers. This paper focuses on pseudo-speaker modeling given the cohort speakers. In this regard, we propose to model the speaker attributes with a distribution instead of the conventional point estimate. Neural networks are utilized to learn pseudo-speaker distributions, leveraging the nonlinear speaker variations among the cohort speakers. Specifically, both utterances and frames are investigated as modeling units. Experiments were conducted on the LibriSpeech and VCTK datasets. The results demonstrated the efficacy of the proposed pseudo-speaker modeling method in improving voice protection performance, especially in increasing the voice distinctiveness among the anonymized utterances. Moreover, our results also show that the pseudo-speaker model performs better when learnt from frame-level representation compared to the utterance-level counterpart.

Original languageEnglish
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
DOIs
Publication statusPublished - Dec 2024

Keywords

  • framelevel speaker distribution
  • pseudo-speaker distribution
  • utterance-level speaker distribution representation
  • Voice anonymization

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Pseudo-speaker Distribution Learning in Voice Anonymization'. Together they form a unique fingerprint.

Cite this