Wav2Spk: A simple DNN architecture for learning speaker embeddings from waveforms

Weiwei Lin, Man Wai Mak

Research output: Journal article publicationConference articleAcademic researchpeer-review


Speaker recognition has seen impressive advances with the advent of deep neural networks (DNNs). However, state-of-the-art speaker recognition systems still rely on human engineering features such as mel-frequency cepstrum coefficients (MFCC). We believe that the handcrafted features limit the potential of the powerful representation of DNNs. Besides, there are also additional steps such as voice activity detection (VAD) and cepstral mean and variance normalization (CMVN) after computing the MFCC. In this paper, we show that MFCC, VAD, and CMVN can be replaced by the tools available in the standard deep learning toolboxes, such as a stacked of stride convolutions, temporal gating, and instance normalization. With these tools, we show that directly learning speaker embeddings from waveforms outperforms an x-vector network that uses MFCC or filter-bank output as features. We achieve an EER of 1.95% on the VoxCeleb1 test set using an end-to-end training scheme, which, to our best knowledge, is the best performance reported using raw waveforms. What's more, the proposed method is complementary with x-vector systems. The fusion of the proposed method with x-vectors trained on filter-bank features produce an EER of 1.55%.

Original languageEnglish
Pages (from-to)3211-3215
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - Oct 2020
Event21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020 - Shanghai, China
Duration: 25 Oct 202029 Oct 2020


  • Deep neural networks
  • End-to-end speaker embedding
  • Speaker verification
  • Temporal gating

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this