TY - GEN
T1 - Speaker Representation Learning via Contrastive Loss with Maximal Speaker Separability
AU - Li, Zhe
AU - Mak, Man Wai
N1 - Funding Information:
This work was in part supported by Research Grants Council of Hong Kong, Theme-based Research Scheme (Ref.: T45-407/19-N).
Publisher Copyright:
© 2022 Asia-Pacific of Signal and Information Processing Association (APSIPA).
PY - 2022/11
Y1 - 2022/11
N2 - A great challenge in speaker representation learning using deep models is to design learning objectives that can enhance the discrimination of unseen speakers under unseen domains. This work proposes a supervised contrastive learning objective to learn a speaker embedding space by effectively leveraging the label information in the training data. In such a space, utterance pairs spoken by the same or similar speakers will stay close, while utterance pairs spoken by different speakers will be far apart. For each training speaker, we perform random data augmentation on their utterances to form positive pairs, and utterances from different speakers form negative pairs. To maxi-mize speaker separability in the embedding space, we incorporate the additive angular-margin loss into the contrastive learning objective. Experimental results on CN-Celeb show that this new learning objective can cause ECAPA-TDNN to find an embedding space that exhibits great speaker discrimination. The contrastive learning objective is easy to implement, and we provide PyTorch code at https://github.com/shanmon110/AAMSupCon.
AB - A great challenge in speaker representation learning using deep models is to design learning objectives that can enhance the discrimination of unseen speakers under unseen domains. This work proposes a supervised contrastive learning objective to learn a speaker embedding space by effectively leveraging the label information in the training data. In such a space, utterance pairs spoken by the same or similar speakers will stay close, while utterance pairs spoken by different speakers will be far apart. For each training speaker, we perform random data augmentation on their utterances to form positive pairs, and utterances from different speakers form negative pairs. To maxi-mize speaker separability in the embedding space, we incorporate the additive angular-margin loss into the contrastive learning objective. Experimental results on CN-Celeb show that this new learning objective can cause ECAPA-TDNN to find an embedding space that exhibits great speaker discrimination. The contrastive learning objective is easy to implement, and we provide PyTorch code at https://github.com/shanmon110/AAMSupCon.
UR - http://www.scopus.com/inward/record.url?scp=85146283223&partnerID=8YFLogxK
U2 - 10.23919/APSIPAASC55919.2022.9980014
DO - 10.23919/APSIPAASC55919.2022.9980014
M3 - Conference article published in proceeding or book
AN - SCOPUS:85146283223
T3 - Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
SP - 962
EP - 967
BT - Proceedings of 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2022
Y2 - 7 November 2022 through 10 November 2022
ER -