Model-Agnostic Meta-Learning for Fast Text-Dependent Speaker Embedding Adaptation

Research output: Journal article publicationJournal articleAcademic researchpeer-review

6 Citations (Scopus)

Abstract

By constraining the lexical content of input speech, text-dependent speaker verification (TD-SV) offers more reliable performance than text-independent speaker verification (TI-SV) when dealing with short utterances. Because speech with constrained lexical content is harder to collect, often TD models are fine-tuned from a TI model using a small target phrase dataset. However, sometimes the target phrase dataset is too tiny for fine-tuning, which is the main obstacle for deploying TD-SV. One solution is to fine-tune the model using medium-size multi-phrase TD data and then deploy the model on the target phrase. Although this strategy does help in some cases, the performance is still sub-optimal because the model is not optimized for the target phrase. Inspired by the recent progress in meta-learning, we propose a three-stage pipeline for adapting a TI model to a TD model for the target phrase. Firstly, a TI model is trained using a large amount of speech data. Then, we use a multi-phrase TD dataset to tune the TI model via model-agnostic meta-learning. Finally, we perform fast adaptation using a small target phrase dataset. Results show that the three-stage pipeline consistently outperforms multi-phrase and target phrase fine-tuning.

Original languageEnglish
Article number10122584
Pages (from-to)1866-1876
Number of pages11
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume31
DOIs
Publication statusPublished - 2023

Keywords

  • Deep speaker embedding
  • MAML
  • meta-learning
  • model adaptation
  • text-dependent speaker verification

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Model-Agnostic Meta-Learning for Fast Text-Dependent Speaker Embedding Adaptation'. Together they form a unique fingerprint.

Cite this