Mixture Representation Learning for Deep Speaker Embedding

Weiwei Lin, Man Wai Mak

Research output: Journal article publicationJournal articleAcademic researchpeer-review

11 Citations (Scopus)

Abstract

How to effectively convert a sequence of variable-length acoustic features to a fixed-dimension representation has always been a research focus in speaker recognition. In state-of-the-art speaker recognition systems, the conversion is implemented by concatenating the mean and the standard deviation of a sequence of frame-level features. However, a single mean and a single standard deviation are limited descriptive statistics for an acoustic sequence even with powerful feature extractors such as convolutional neural networks. In this paper, we propose a novel statistics pooling method that can produce more descriptive statistics through a mixture representation. Our approach is inspired by the expectation-maximization (EM) algorithm in Gaussian mixture models (GMMs). Instead of using traditional GMM style alignment, we novelly leverage modern deep learning tools to produce a more powerful mixture representation. The novelty includes: (1) unlike GMMs, the mixture assignments are determined by an attention network instead of the Euclidean distances between the frame-level features and explicit centers; (2) instead of using a single frame as input to the attention network, contextual frames are included to smooth out attention transition; and (3) soft-attention assignments are replaced by hard-attention assignments via the Gumbel-Softmax with straight-through estimators. With the proposed attention mechanism, we obtained a 13.7% relative improvement over vanilla mean and standard deviation pooling in the VOiCES19-eval set.

Original languageEnglish
Article number9722992
Pages (from-to)968-978
Number of pages11
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume30
DOIs
Publication statusPublished - Feb 2022

Keywords

  • attention models
  • deep neural networks
  • Gumbel-Softmax
  • Speaker recognition
  • statistics pooling

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Mixture Representation Learning for Deep Speaker Embedding'. Together they form a unique fingerprint.

Cite this