Phone-centric local variability vector for text-constrained speaker verification

Liping Chen, Kong Aik Lee, Bin Ma, Wu Guo, Haizhou Li, Li Rong Dai

Research output: Journal article publicationConference articleAcademic researchpeer-review

13 Citations (Scopus)

Abstract

This paper investigates the use of frame alignment given by a deep neural network (DNN) for text-constrained speaker verification task, where the lexical contents of the test utterances are limited to a finite set of vocabulary. The DNN makes use of information carried by the target and its contextual frames to assign it probabilistically to one of the phonetic states. The frame alignment is therefore more precise and less ambiguous than that generated by a Gaussian mixture model (GMM). Using the DNN alignment, we show that an i-vector can be decomposed into segments of local variability vectors, each corresponding to a monophone, where each local vector models session variability given the phonetic context. Based on the local vectors, the content matching between the utterances for comparison can be accomplished in the PLDA scoring. Experiments conducted on the RSR2015 database shows that the proposed phone-centric local variability vector achieves a better performance compared to the i-vector.

Original languageEnglish
Pages (from-to)229-233
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2015-January
Publication statusPublished - Sept 2015
Externally publishedYes
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: 6 Sept 201510 Sept 2015

Keywords

  • Deep neural network
  • Phone-centric local variability
  • Text-constrained speaker verification

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Phone-centric local variability vector for text-constrained speaker verification'. Together they form a unique fingerprint.

Cite this