TY - GEN
T1 - Speaker Turn Aware Similarity Scoring for Diarization of Speech-Based Cognitive Assessments
AU - Xu, Sean Shensheng
AU - Mak, Man Wai
AU - Wong, Ka Ho
AU - Meng, Helen
AU - Kwok, Timothy C.Y.
N1 - Funding Information:
ACKNOWLEDGMENT This work was in part supported by Research Grands Council of Hong Kong, Theme-based Research Scheme (Ref.: T45-407/19-N).
Publisher Copyright:
© 2021 APSIPA.
PY - 2021/12
Y1 - 2021/12
N2 - This paper proposes two enhancements to the con-ventional speaker diarization methods for speech-based Montreal cognitive assessments (MoCA). The enhancements address the technical challenges of MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent at-tention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Evaluations on an interactive dialog dataset for MoCA show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone mismatch scenarios. The results also show that the speaker-turn timestamps can be hypothesized, suggesting that the proposed enhancements are amendable to datasets without speaker timestamp information.
AB - This paper proposes two enhancements to the con-ventional speaker diarization methods for speech-based Montreal cognitive assessments (MoCA). The enhancements address the technical challenges of MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent at-tention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Evaluations on an interactive dialog dataset for MoCA show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone mismatch scenarios. The results also show that the speaker-turn timestamps can be hypothesized, suggesting that the proposed enhancements are amendable to datasets without speaker timestamp information.
UR - http://www.scopus.com/inward/record.url?scp=85126646285&partnerID=8YFLogxK
M3 - Conference article published in proceeding or book
AN - SCOPUS:85126646285
T3 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
SP - 1299
EP - 1304
BT - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2021
Y2 - 14 December 2021 through 17 December 2021
ER -