Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement

Youzhi Tu, Man Wai Mak, Jen Tzung Chien

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

Contrastive self-supervised learning has been widely used in speaker embedding to address the labeling challenge. Contrastive speaker embedding assumes that the contrast between the positive and negative pairs of speech segments is attributed to speaker identity only. However, this assumption is incorrect because speech signals contain not only speaker identity but also linguistic content. In this paper, we propose a contrastive learning framework with sequential disentanglement to remove linguistic content by incorporating a disentangled sequential variational autoencoder (DSVAE) into the conventional contrastive learning framework. The DSVAE aims to disentangle speaker factors from content factors in an embedding space so that the speaker factors become the main contributor to the contrastive loss. Because content factors have been removed from contrastive learning, the resulting speaker embeddings will be content-invariant. The learned embeddings are also robust to language mismatch. It is shown that the proposed method consistently outperforms the conventional contrastive speaker embedding on the VoxCeleb1 and CN-Celeb datasets. This finding suggests that applying sequential disentanglement is beneficial to learning speaker-discriminative embeddings.

Original languageEnglish
Article number10531230
Pages (from-to)2704-2715
Number of pages12
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
Publication statusPublished - 16 May 2024

Keywords

  • Acoustics
  • contrastive learning
  • disentangled representation learning
  • Electronic mail
  • Faces
  • Labeling
  • Linguistics
  • speaker embedding
  • Speaker verification
  • Speech processing
  • Training
  • variational autoencoder

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Contrastive Self-Supervised Speaker Embedding With Sequential Disentanglement'. Together they form a unique fingerprint.

Cite this