Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification

Chong-Xin Gan, Man-Wai Mak, Weiwei Lin, Jen-Tzung Chien

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

Abstract

Contrastive self-supervised learning (CSL) for speaker verification (SV) has drawn increasing interest recently due to its ability to exploit unlabeled data. Performing data augmentation on raw waveforms, such as adding noise or reverberation, plays a pivotal role in achieving promising results in SV. Data augmentation, however, demands meticulous calibration to ensure intact speaker-specific information, which is difficult to achieve without speaker labels. To address this issue, we introduce a novel framework by incorporating clean and augmented segments into the contrastive training pipeline. The clean segments are repurposed to pair with noisy segments to form additional positive and negative pairs. Moreover, the contrastive loss is weighted to increase the difference between the clean and augmented embeddings of different speakers. Experimental results on Voxceleb1 suggest that the proposed framework can achieve a remarkable 19% improvement over the conventional methods, and it surpasses many existing state-of-the-art techniques.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings
PublisherIEEE
Pages11081-11085
Number of pages5
ISBN (Electronic)9798350344851
ISBN (Print)1520-6149
Publication statusPublished - Apr 2024

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Keywords

  • Speaker verification
  • contrastive learning
  • hard negative pairs
  • self-supervised learning
  • weighted contrastive loss

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Asymmetric Clean Segments-Guided Self-Supervised Learning for Robust Speaker Verification'. Together they form a unique fingerprint.

Cite this