Deep Speaker Embedding for Robust Speaker Verification

Student thesis: PhD


Speaker verification (SV) aims to determine whether the speaker identity of a test utterance matches that of a target speaker. Although state-of-the-art deep speaker embedding has achieved outstanding performance, deploying SV systems to adverse acoustic environments still faces a number of challenges. First, today’s SV systems rely on the condition that the training and test data share the same distribution. Once this condition is violated, domain mismatch will occur. The problem will be exacerbated when the speaker embeddings violate the Gaussianity constraint. Second, because the temporal feature maps produced by the last frame-level layer are highly non-stationary, it is not desirable to use their global statistics as speaker embeddings. Third, current speaker embedding networks do not have any mechanisms to let the frame-level information flow directly into the embeddings layer, causing information loss in the pooling layer.
This thesis develops three strategies to address the above challenges. First, to jointly address domain mismatch and the Gaussianity requirement of probabilistic linear discriminant analysis (PLDA) models, we propose a variational domain adver- sarial learning framework with two specialized networks: variational domain adver- sarial neural network (VDANN) and information-maximized VDANN (InfoVDANN). Both networks leverage domain adversarial training to produce speaker discrimina- tive and domain-invariant embeddings and apply variational autoencoders (VAEs) to perform Gaussian regularization. The InfoVDANN, in particular, avoids posterior collapse in VDANNs by preserving the mutual information (MI) between the domain- invariant embeddings and the speaker embeddings. Second, to mitigate the effect of non-stationarity in the temporal feature maps, we propose short-time spectral pooling (STSP) and attentive STSP to transform the temporal feature maps into the spectral domain through short-time Fourier transform (STFT). The zero- and low-frequency components are retained to preserve speaker information. A segment-level atten-
tion mechanism is designed to produce attention weights with fewer variations, which results in better generalization and robustness to the non-stationary effect in the fea- ture maps. Third, to allow information in the frame-level layers to flow directly to the speaker embedding layer, MI-enhanced training based on a semi-supervised deep InfoMax (DIM) framework is proposed. Because the dimensionality of the frame- level features is much larger than that of the speaker embeddings, we propose to squeeze the frame-level features via global pooling before MI estimation. The pro- posed method, called squeeze-DIM, effectively balances the dimension between the frame-level features and the speaker embeddings.
We evaluate the proposed methods on Voxceleb, VOiCES 2019, SRE16, and SRE18. Results show that the VDANN and InfoVDANN outperform the DANN baseline, indicating the effectiveness of Gaussian regularization and MI maximiza- tion. We also observed that attentive STSP achieved the largest performance gains, suggesting that applying segment-level attention and leveraging low spectral com- ponents of temporal feature maps can produce discriminative speaker embeddings. Finally, we demonstrate that the squeeze-DIM outperforms the DIM regularization, suggesting that the squeeze operation facilitates MI maximization.
Date of AwardMar 2022
Original languageEnglish
Awarding Institution
  • The Hong Kong Polytechnic University
SupervisorMan Wai Mak (Chief supervisor)


  • Speaker Recognition
  • Speaker Embedding
  • Deep Learning
  • Spectral Pooling
  • Deep Neural Networks

Cite this