Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding?

Qiongqiong Wang, Koji Okabe, Kong Aik Lee, Hitoshi Yamamoto, Takafumi Koshinaka

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

20 Citations (Scopus)

Abstract

This paper presents an experimental study on deep speaker embedding with an attention mechanism that has been found to be a powerful representation learning technique in speaker recognition. In this framework, an attention model works as a frame selector that computes an attention weight for each frame-level feature vector, in accord with which an utterance-level representation is produced at the pooling layer in a speaker embedding network. In general, an attention model is trained together with the speaker embedding network on a single objective function, and thus those two components are tightly bound to one another. In this paper, we consider the possibility that the attention model might be decoupled from its parent network and assist other speaker embedding networks and even conventional i-vector extractors. This possibility is demonstrated through a series of experiments on a NIST Speaker Recognition Evaluation (SRE) task, with 9.0% EER reduction and 3.8% minCprimary reduction when the attention weights are applied to i-vector extraction. Another experiment shows that DNN-based soft voice activity detection (VAD) can be effectively combined with the attention mechanism to yield further reduction of minCprimary by 6.6% and 1.6% in deep speaker embedding and i-vector systems, respectively.

Original languageEnglish
Title of host publication2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1052-1059
Number of pages8
ISBN (Electronic)9781538643341
DOIs
Publication statusPublished - 2 Jul 2018
Externally publishedYes
Event2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Athens, Greece
Duration: 18 Dec 201821 Dec 2018

Publication series

Name2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings

Conference

Conference2018 IEEE Spoken Language Technology Workshop, SLT 2018
Country/TerritoryGreece
CityAthens
Period18/12/1821/12/18

Keywords

  • attention
  • DNN
  • i-vector
  • speaker embedding
  • speaker recognition

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Attention Mechanism in Speaker Recognition: What Does it Learn in Deep Speaker Embedding?'. Together they form a unique fingerprint.

Cite this