TY - GEN
T1 - Two-stage Semi-supervised Speaker Recognition with Gated Label Learning
AU - Wang, Xingmei
AU - Meng, Jiaxiang
AU - Lee, Kong Aik
AU - Li, Boquan
AU - Liu, Jinghan
N1 - Publisher Copyright:
© 2024 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2024/8
Y1 - 2024/8
N2 - Speaker recognition technologies have been successfully applied in diverse domains, benefiting from the advance of deep learning. Nevertheless, current efforts are still subject to the lack of labeled data. Such issues have been attempted in computer vision, through semi-supervised learning (SSL) that assigns pseudo labels for unlabeled data, undertaking the role of labeled ones. Through our empirical evaluations, the state-of-the-art SSL methods show unsatisfactory performance in speaker recognition tasks, due to the imbalance between the quantity and quality of pseudo labels. Therefore, in this work, we propose a two-stage SSL framework, with the aim to address the data scarcity challenge. We first construct an initial contrastive learning network, where the encoder outputs the embedding representation of utterances. Furthermore, we construct an iterative holistic semi-supervised learning network that involves a clustering strategy to assign pseudo labels, and a gated label learning (GLL) strategy to further select reliable pseudo-label data. Systematical evaluations show that our proposed framework achieves superior performance in speaker recognition than the state-of-the-art methods, matching the performance of supervised learning.
AB - Speaker recognition technologies have been successfully applied in diverse domains, benefiting from the advance of deep learning. Nevertheless, current efforts are still subject to the lack of labeled data. Such issues have been attempted in computer vision, through semi-supervised learning (SSL) that assigns pseudo labels for unlabeled data, undertaking the role of labeled ones. Through our empirical evaluations, the state-of-the-art SSL methods show unsatisfactory performance in speaker recognition tasks, due to the imbalance between the quantity and quality of pseudo labels. Therefore, in this work, we propose a two-stage SSL framework, with the aim to address the data scarcity challenge. We first construct an initial contrastive learning network, where the encoder outputs the embedding representation of utterances. Furthermore, we construct an iterative holistic semi-supervised learning network that involves a clustering strategy to assign pseudo labels, and a gated label learning (GLL) strategy to further select reliable pseudo-label data. Systematical evaluations show that our proposed framework achieves superior performance in speaker recognition than the state-of-the-art methods, matching the performance of supervised learning.
UR - http://www.scopus.com/inward/record.url?scp=85204295782&partnerID=8YFLogxK
U2 - 10.24963/ijcai.2024/718
DO - 10.24963/ijcai.2024/718
M3 - Conference article published in proceeding or book
AN - SCOPUS:85204295782
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 6495
EP - 6503
BT - Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
A2 - Larson, Kate
PB - International Joint Conferences on Artificial Intelligence
T2 - 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
Y2 - 3 August 2024 through 9 August 2024
ER -