Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

5 Citations (Scopus)

Abstract

Training speaker verification (SV) systems without labeled data is challenging. To tackle the challenge, we propose Multi-Head, Multi-Mode (MeMo) self-supervised learning based on knowledge distillation. Unlike DINO, the teacher in MeMo uses two distinct architectures to learn collaboratively, and so does the student. MeMo employs two distillation modes: self- and cross-distillations, with the teacher and student having the same and different architectures, respectively. To reduce the output discrepancy caused by different architectures, we divide the projection head into self- and cross-heads so that each head is responsible for distillation in its respective mode. We also discover that contrastive learning at the embedding level is supportive only in early training stages. To address this issue, we propose dynamically stopping the contrastive learning while continuing knowledge distillation. MeMo achieves an impressive EER of 3.10% on Voxceleb1 using a small ECAPA-TDNN backbone.

Original languageEnglish
Title of host publicationEnglish
Pages4723-4727
Number of pages5
DOIs
Publication statusPublished - Sept 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sept 20245 Sept 2024

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association
ISSN (Print)2308-457X

Conference

Conference25th Interspeech Conferece 2024
Country/TerritoryGreece
CityKos Island
Period1/09/245/09/24

Keywords

  • cross-distillation
  • DINO
  • knowledge distillation
  • self-supervised learning
  • speaker verification

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification'. Together they form a unique fingerprint.

Cite this