TY - GEN
T1 - Investigation of Perception Inconsistency in Speaker Embedding for Asynchronous Voice Anonymization
AU - Wang, Rui
AU - Chen, Liping
AU - Lee, Kong Aik
AU - Zha, Zhengpeng
AU - Ling, Zhenhua
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Given the speech generation framework that represents the speaker attribute with an embedding vector, asynchronous voice anonymization can be achieved by modifying the speaker embedding derived from the original speech. However, the inconsistency between machine and human perceptions of the speaker attribute within the speaker embedding remains unexplored, limiting its performance in asynchronous voice anonymization. To this end, this study investigates this inconsistency via modifications to speaker embedding in the speech generation process. Experiments conducted on the FACodec and Diff-HierVC speech generation models discover a subspace whose removal alters machine perception while preserving its human perception of the speaker attribute in the generated speech. With these findings, an asynchronous voice anonymization is developed, achieving 100 % human perception preservation rate while obscuring the machine perception. Audio samples can be found in https://voiceprivacy.github.io/speaker-embedding-eigen-decomposition/.
AB - Given the speech generation framework that represents the speaker attribute with an embedding vector, asynchronous voice anonymization can be achieved by modifying the speaker embedding derived from the original speech. However, the inconsistency between machine and human perceptions of the speaker attribute within the speaker embedding remains unexplored, limiting its performance in asynchronous voice anonymization. To this end, this study investigates this inconsistency via modifications to speaker embedding in the speech generation process. Experiments conducted on the FACodec and Diff-HierVC speech generation models discover a subspace whose removal alters machine perception while preserving its human perception of the speaker attribute in the generated speech. With these findings, an asynchronous voice anonymization is developed, achieving 100 % human perception preservation rate while obscuring the machine perception. Audio samples can be found in https://voiceprivacy.github.io/speaker-embedding-eigen-decomposition/.
UR - https://www.scopus.com/pages/publications/105030438606
U2 - 10.1109/APSIPAASC65261.2025.11249401
DO - 10.1109/APSIPAASC65261.2025.11249401
M3 - Conference article published in proceeding or book
AN - SCOPUS:105030438606
T3 - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
SP - 2074
EP - 2079
BT - 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2025
Y2 - 22 October 2025 through 24 October 2025
ER -