TY - GEN
T1 - Self-Supervised Learning with Multi-Head Multi-Mode Knowledge Distillation for Speaker Verification
AU - Jin, Zezhong
AU - Tu, Youzhi
AU - Mak, Man Wai
N1 - Publisher Copyright:
© 2024 International Speech Communication Association. All rights reserved.
PY - 2024/9
Y1 - 2024/9
N2 - Training speaker verification (SV) systems without labeled data is challenging. To tackle the challenge, we propose Multi-Head, Multi-Mode (MeMo) self-supervised learning based on knowledge distillation. Unlike DINO, the teacher in MeMo uses two distinct architectures to learn collaboratively, and so does the student. MeMo employs two distillation modes: self- and cross-distillations, with the teacher and student having the same and different architectures, respectively. To reduce the output discrepancy caused by different architectures, we divide the projection head into self- and cross-heads so that each head is responsible for distillation in its respective mode. We also discover that contrastive learning at the embedding level is supportive only in early training stages. To address this issue, we propose dynamically stopping the contrastive learning while continuing knowledge distillation. MeMo achieves an impressive EER of 3.10% on Voxceleb1 using a small ECAPA-TDNN backbone.
AB - Training speaker verification (SV) systems without labeled data is challenging. To tackle the challenge, we propose Multi-Head, Multi-Mode (MeMo) self-supervised learning based on knowledge distillation. Unlike DINO, the teacher in MeMo uses two distinct architectures to learn collaboratively, and so does the student. MeMo employs two distillation modes: self- and cross-distillations, with the teacher and student having the same and different architectures, respectively. To reduce the output discrepancy caused by different architectures, we divide the projection head into self- and cross-heads so that each head is responsible for distillation in its respective mode. We also discover that contrastive learning at the embedding level is supportive only in early training stages. To address this issue, we propose dynamically stopping the contrastive learning while continuing knowledge distillation. MeMo achieves an impressive EER of 3.10% on Voxceleb1 using a small ECAPA-TDNN backbone.
KW - cross-distillation
KW - DINO
KW - knowledge distillation
KW - self-supervised learning
KW - speaker verification
UR - https://www.scopus.com/pages/publications/85214827410
U2 - 10.21437/Interspeech.2024-360
DO - 10.21437/Interspeech.2024-360
M3 - Conference article published in proceeding or book
AN - SCOPUS:85214827410
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4723
EP - 4727
BT - English
T2 - 25th Interspeech Conferece 2024
Y2 - 1 September 2024 through 5 September 2024
ER -