TY - GEN
T1 - Joint feature enhancement and speaker recognition with multi-objective task-oriented network
AU - Wu, Yibo
AU - Wang, Longbiao
AU - Lee, Kong Aik
AU - Liu, Meng
AU - Dang, Jianwu
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021/9
Y1 - 2021/9
N2 - Recently, increasing attention has been paid to the joint training of upstream and downstream tasks, and to address the challenge of how to synchronize various loss functions in a multiobjective scenario. In this paper, to address the competing gradient directions between the speaker classification loss and the feature enhancement loss, we propose an asynchronous subregion optimization approach for the joint training of feature enhancement and speaker embedding neural networks. For the asynchronous subregion optimization, the squeeze and excitation (SE) method is introduced in the enhancement network to adaptively select important channels for speaker embedding. Furthermore, channel-wise feature concatenation is applied between the input feature and the enhanced feature to address the distortion of speaker information that is caused by enhancement loss. By using the proposed joint training network with asynchronous subregion optimization and channel-wise feature concatenation, we obtained relative gains of 11.95% and 6.43% in equal error rate on a noisy version of Voxceleb1 and VOiCES corpus, respectively.
AB - Recently, increasing attention has been paid to the joint training of upstream and downstream tasks, and to address the challenge of how to synchronize various loss functions in a multiobjective scenario. In this paper, to address the competing gradient directions between the speaker classification loss and the feature enhancement loss, we propose an asynchronous subregion optimization approach for the joint training of feature enhancement and speaker embedding neural networks. For the asynchronous subregion optimization, the squeeze and excitation (SE) method is introduced in the enhancement network to adaptively select important channels for speaker embedding. Furthermore, channel-wise feature concatenation is applied between the input feature and the enhanced feature to address the distortion of speaker information that is caused by enhancement loss. By using the proposed joint training network with asynchronous subregion optimization and channel-wise feature concatenation, we obtained relative gains of 11.95% and 6.43% in equal error rate on a noisy version of Voxceleb1 and VOiCES corpus, respectively.
KW - Far-field speaker verification
KW - Feature enhancement
KW - Joint training
KW - Squeeze and excitation
UR - http://www.scopus.com/inward/record.url?scp=85119181222&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1978
DO - 10.21437/Interspeech.2021-1978
M3 - Conference article published in proceeding or book
AN - SCOPUS:85119181222
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1993
EP - 1997
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -