AbstractThe rapid development of technology has driven the society into a new era of AI in which speaker recognition is one of the important techniques. Due to the unique characteristics of voiceprints, speaker recognition has been used for enhancing the security level of bank- ing and personal security systems. In spite of the great convenience provided by speaker recognition technology, there are some key problems remaining unsolved, which include (1) insufficient labeled samples from new acoustic environments for training supervised machine learning models and (2) domain mismatch among different acoustic enviroments. These key problems may result in severe performance degradation in speaker recognition systems.
We proposed two methods to address the above problems: (1) unsupervised domain adaptation for speaker verification and (2) a contrastive adversarial domain adaptation net- work for domain-invariant speaker identification. The first method addresses the data spar- sity issue through applying spectral clustering on in-domain unlabeled data to obtain hy- pothesized speaker labels for adapting an out-of-domain PLDA mixture model to the target domain. To further refine the target PLDA mixture model, spectral clustering is iteratively applied to the new PLDA score matrix to produce a new set of hypothesized speaker la- bels. Moreover, a gender-aware deep neural network (DNN) is trained to produce gender posteriors given an i-vector. The gender posteriors then replace the posterior probabilities of the indicator variables in the PLDA mixture model. A gender-dependent inter dataset variability compensation (GD-IDVC) is implemented to reduce the mismatch between the i-vectors obtained from the in-domain and out-of-domain datasets. Evaluations based on
NIST 2016 SRE show that at the end of the iterative re-training, the PLDA mixture model becomes fully adapted to the new domain. Results also show that the PLDA scores can be readily incorporated into spectral clustering, resulting in high quality speaker clusters that could not be possibly achieved by agglomerative hierarchical clustering.
The second method aims to reduce the mismatch between male and female speakers through adversarial domain adaptation. The method mitigates an intrinsic drawback of the domain adversarial network by splitting the feature extractor into two contrastive branches, with one branch delegating for the class-dependence in the latent space and another branch focusing on the domain-invariance. The feature extractor achieves these contrastive goals by sharing the first and the last hidden layers but having the decoupled branches in the middle hidden layers. To encourage the feature extractor to produce class-discriminative embedded features, the label predictor is adversarially trained to produce equal posterior probabilities across all of its outputs instead of producing one-hot outputs. We refer to the resulting domain adaptation network as contrastive adversarial domain adaptation network (CADAN). We evaluated the domain-invariance of the embedded features via a series of speaker identification experiments under both clean and noisy conditions. Results demon- strate that the embedded features produced by CADAN lead to 8.9% and 77.6% improve- ment in speaker identification accuracy when compared with the conventional DAN under clean and noisy conditions, respectively.
|Date of Award||2020|
|Supervisor||Man Wai Mak (Chief supervisor)|