A Framework for Adapting DNN Speaker Embedding across Languages

Weiwei Lin, Man Wai Mak, Na Li, Dan Su, Dong Yu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

14 Citations (Scopus)

Abstract

Language mismatch remains a major hindrance to the extensive deployment of speaker verification (SV) systems. Current language adaptation methods in SV mainly rely on linear projection in embedding space; i.e., adaptation is carried out after the speaker embeddings have been created, which underutilizes the powerful representation of deep neural networks. This article proposes a maximum mean discrepancy (MMD) based framework for adapting deep neural network (DNN) speaker embedding across languages, featuring multi-level domain loss, separate batch normalization, and consistency regularization. We refer to the framework as MSC. We show that (1) minimizing domain discrepancy at both frame- and utterance-levels performs significantly better than at utterance-level alone; (2) separating the source-domain data from the target-domain in batch normalization improves adaptation performance; and (3) data augmentation can be utilized in the unlabelled target-domain through consistency regularization. By combining these findings, we achieve an EER of 8.69% and 7.95% in NIST SRE 2016 and 2018, respectively, which are significantly better than the previously proposed DNN adaptation methods. Our framework also works well with backend adaptation. By combining the proposed framework with backend adaptation, we achieve an 11.8% improvement over the backend adaptation in SRE18. When applying our framework to a 121-layer Densenet, we achieved an EER of 7.81% and 7.02% in NIST SRE 2016 and 2018, respectively.

Original languageEnglish
Article number9224137
Pages (from-to)2810-2822
Number of pages13
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume28
DOIs
Publication statusPublished - Oct 2020

Keywords

  • data augmentation
  • domain adaptation
  • maximum mean discrepancy
  • Speaker verification (SV)
  • transfer learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'A Framework for Adapting DNN Speaker Embedding across Languages'. Together they form a unique fingerprint.

Cite this