Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs

Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamaki, Haizhou Li

Research output: Journal article publicationJournal articleAcademic researchpeer-review

3 Citations (Scopus)

Abstract

We study a novel neural speaker encoder and its training strategies for speaker recognition without using any identity labels. The speaker encoder is trained to extract a fixed dimensional speaker embedding from a spoken utterance of variable length. Contrastive learning is a typical self-supervised learning technique. However, the contrastive learning of the speaker encoder depends very much on the sampling strategy of positive and negative pairs. It is common that we sample a positive pair of segments from the same utterance. Unfortunately, such a strategy, denoted as poor-man's positive pairs (PPP), lacks the necessary diversity. In this work, we propose a multi-modal contrastive learning technique with novel sampling strategies. By cross-referencing between speech and face data, we find diverse positive pairs (DPP) for contrastive learning, thus improving the robustness of speaker encoder. We train the speaker encoder on the VoxCeleb2 dataset without any speaker labels, and achieve an equal error rate (EER) of 2.89%, 3.17% and 6.27% under the proposed progressive clustering strategy, and an EER of 1.44%, 1.77% and 3.27% under the two-stage learning strategy with pseudo labels, on the three test sets of VoxCeleb1. This novel solution outperforms the state-of-the-art self-supervised learning methods by a large margin, at the same time, achieves comparable results with the supervised learning counterpart. We also evaluate our self-supervised learning technique on the LRS2 and LRW datasets, where speaker information is unavailable. All experiments suggest that the proposed neural architecture and sampling strategies are robust across datasets.

Original languageEnglish
Article number10106039
Pages (from-to)1706-1719
Number of pages14
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume31
DOIs
Publication statusPublished - Apr 2023
Externally publishedYes

Keywords

  • diverse positive pairs
  • multi-modal
  • progressive clustering
  • Self-supervised learning
  • speaker recognition

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Self-Supervised Training of Speaker Encoder With Multi-Modal Diverse Positive Pairs'. Together they form a unique fingerprint.

Cite this