Mixture Representation Learning for Deep Speaker Embedding

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

How to effectively convert a sequence of variable-length acoustic features to a fixed-dimension representation has always been a research focus in speaker recognition. In state-of-the-art speaker recognition systems, the conversion is implemented by concatenating the mean and the standard deviation of a sequence of frame-level features. However, a single mean and a single standard deviation are limited descriptive statistics for an acoustic sequence even with powerful feature extractors such as convolutional neural networks. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our approach is inspired by the expectation–maximization (EM) algorithm in Gaussian mixture models (GMMs). Instead of using traditional GMM style alignment, we novelly leverage modern deep learning tools to produce a more powerful mixture representation. The novelty includes: (1) unlike GMMs, the mixture assignments are determined by an attention network instead of the Euclidean distances between the frame-level features and explicit centers; (2) instead of using a single frame as input to the attention network, contextual frames are included to smooth out attention transition; and (3) soft-attention assignments are replaced by hard-attention assignments via the Gumbel-Softmax with straight-through estimators. With the proposed attention mechanism, we obtained a 13.7% relative improvement over vanilla mean and standard deviation pooling in the VOiCES19-eval set.
Original languageOthers/Unknown
Pages (from-to)968-978
Number of pages11
JournalIEEE/ACM Transactions on Audio, Speech, and Language Processing
Volume30
DOIs
Publication statusPublished - Feb 2022

Cite this