TY - GEN
T1 - Age-Invariant Speaker Embedding for Diarization of Cognitive Assessments
AU - Xu, Sean Shensheng
AU - Mak, Man Wai
AU - Wong, Ka Ho
AU - Meng, Helen
AU - Kwok, Timothy C.Y.
N1 - Funding Information:
This work was in part supported by Research Grands Council of Hong Kong, Theme-based Research Scheme (Ref.: T45-407/19-N).
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/24
Y1 - 2021/1/24
N2 - This paper investigates an age-invariant speaker embedding approach to speaker diarization, which is an essential step towards the automatic cognitive assessments from speech. Studies have shown that incorporating speaker traits (e.g., age, gender, etc.) can improve speaker diarization performance. However, we found that age information in the speaker embeddings is detrimental to speaker diarization if there is a severe mismatch between the age distributions in the training data and test data. To minimize the detrimental effect of age mismatch, an adversarial training strategy is introduced to remove age variability from the utterance-level speaker embeddings. Evaluations on an interactive dialog dataset for Montreal cognitive assessments (MoCA) show that the adversarial training strategy can produce age-invariant embeddings and reduce diarization error rate (DER) by 4.33%. The approach also outperforms the conventional method even with less training data.
AB - This paper investigates an age-invariant speaker embedding approach to speaker diarization, which is an essential step towards the automatic cognitive assessments from speech. Studies have shown that incorporating speaker traits (e.g., age, gender, etc.) can improve speaker diarization performance. However, we found that age information in the speaker embeddings is detrimental to speaker diarization if there is a severe mismatch between the age distributions in the training data and test data. To minimize the detrimental effect of age mismatch, an adversarial training strategy is introduced to remove age variability from the utterance-level speaker embeddings. Evaluations on an interactive dialog dataset for Montreal cognitive assessments (MoCA) show that the adversarial training strategy can produce age-invariant embeddings and reduce diarization error rate (DER) by 4.33%. The approach also outperforms the conventional method even with less training data.
KW - age-invariant speaker embedding
KW - deep neural networks
KW - Montreal cognitive assessments
KW - speaker diarization
UR - http://www.scopus.com/inward/record.url?scp=85102583136&partnerID=8YFLogxK
U2 - 10.1109/ISCSLP49672.2021.9362084
DO - 10.1109/ISCSLP49672.2021.9362084
M3 - Conference article published in proceeding or book
AN - SCOPUS:85102583136
T3 - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
BT - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
PB - Institute of Electrical and Electronics Engineers Inc.
CY - Hong Kong
T2 - 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
Y2 - 24 January 2021 through 27 January 2021
ER -