Abstract
How to effectively convert a sequence of variable-length acoustic features to a fixed-dimension representation has always been a research focus in speaker recognition. In state-of-the-art speaker recognition systems, the conversion is implemented by concatenating the mean and the standard deviation of a sequence of frame-level features. However, a single mean and a single standard deviation are limited descriptive statistics for an acoustic sequence even with powerful feature extractors such as convolutional neural networks. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our approach is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). Instead of using traditional GMM style alignment, we novelly leverage modern deep learning tools to produce a more powerful mixture representation. The novelty includes: (1) unlike GMMs, the mixture assignments are determined by an attention network instead of the Euclidean distances between the frame-level features and explicit centers; (2) instead of using a single frame as input to the attention network, contextual frames are included to smooth out attention transition; and (3) soft-attention assignments are replaced by hard-attention assignments via the Gumbel-Softmax with straight-through estimators. With the proposed attention mechanism, we obtained a 13.7% relative improvement over vanilla mean and standard deviation pooling in the VOiCES19-eval set.
Original language | English |
---|---|
Article number | 9722992 |
Pages (from-to) | 968-978 |
Number of pages | 11 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 30 |
DOIs | |
Publication status | Published - Feb 2022 |
Keywords
- attention models
- deep neural networks
- Gumbel-Softmax
- Speaker recognition
- statistics pooling
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Acoustics and Ultrasonics
- Computational Mathematics
- Electrical and Electronic Engineering