TY - GEN
T1 - Channel Interdependence Enhanced Speaker Embeddings for Far-Field Speaker Verification
AU - Zhao, Ling Jun
AU - Mak, Man Wai
N1 - Funding Information:
This work was in part supported by the Huawei Technologies Co., Ltd, Project No. YBN2019095008 and National Natural Science Foundation of China (NSFC), Grant No. 61971371.
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/24
Y1 - 2021/1/24
N2 - Recognizing speakers from a distance using far-field microphones is difficult because of the environmental noise and reverberation distortion. In this work, we tackle these problems by strengthening the frame-level processing and feature aggregation of x-vector networks. Specifically, we restructure the dilated convolutional layers into Res2Net blocks to generate multi-scale frame-level features. To exploit the relationship between the channels, we introduce squeeze-and-excitation (SE) units to rescale the channels' activations and investigate the best places to put these SE units in the Res2Net blocks. Based on the hypothesis that layers at different depth contain speaker information at different granularity levels, multi-block feature aggregation is introduced to propagate and aggregate the features at various depths. To optimally weight the channels and frames during feature aggregation, we propose a channel-dependent attention mechanism. Combining all of these enhancements leads to a network architecture called channel-interdependence enhanced Res2Net (CE-Res2Net). Results show that the proposed network achieves a relative improvement of about 16% in EER and 17% in minDCF on the VOiCES 2019 Challenge's evaluation set.
AB - Recognizing speakers from a distance using far-field microphones is difficult because of the environmental noise and reverberation distortion. In this work, we tackle these problems by strengthening the frame-level processing and feature aggregation of x-vector networks. Specifically, we restructure the dilated convolutional layers into Res2Net blocks to generate multi-scale frame-level features. To exploit the relationship between the channels, we introduce squeeze-and-excitation (SE) units to rescale the channels' activations and investigate the best places to put these SE units in the Res2Net blocks. Based on the hypothesis that layers at different depth contain speaker information at different granularity levels, multi-block feature aggregation is introduced to propagate and aggregate the features at various depths. To optimally weight the channels and frames during feature aggregation, we propose a channel-dependent attention mechanism. Combining all of these enhancements leads to a network architecture called channel-interdependence enhanced Res2Net (CE-Res2Net). Results show that the proposed network achieves a relative improvement of about 16% in EER and 17% in minDCF on the VOiCES 2019 Challenge's evaluation set.
KW - channel-dependent attention
KW - Far-field speaker verification
KW - Res2Net
KW - speaker embedding
KW - Squeeze-and-excitation
UR - http://www.scopus.com/inward/record.url?scp=85102595695&partnerID=8YFLogxK
U2 - 10.1109/ISCSLP49672.2021.9362108
DO - 10.1109/ISCSLP49672.2021.9362108
M3 - Conference article published in proceeding or book
AN - SCOPUS:85102595695
T3 - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
BT - 2021 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
PB - Institute of Electrical and Electronics Engineers Inc.
CY - Hong Kong
T2 - 12th International Symposium on Chinese Spoken Language Processing, ISCSLP 2021
Y2 - 24 January 2021 through 27 January 2021
ER -