Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

Research output: Journal article publicationJournal articleAcademic researchpeer-review

10 Citations (Scopus)

Abstract

The residual neural networks (ResNet) demonstrate the impressive performance in automatic speaker verification (ASV). They treat the time and frequency dimensions equally, following the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. We address this issue and postulate Golden-Gemini Hypothesis, which posits the prioritization of temporal resolution over frequency resolution for ASV. The hypothesis is verified by conducting a systematic study on the impact of temporal and frequency resolutions on the performance, using a trellis diagram to represent the stride space. We further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based ASV models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

Original languageEnglish
Article number10497864
Pages (from-to)2324-2337
Number of pages14
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume32
DOIs
Publication statusPublished - Apr 2024

Keywords

  • 2D CNN
  • Computer architecture
  • Convolutional neural networks
  • Image resolution
  • Neural networks
  • ResNet
  • speaker recognition
  • Speaker verification
  • stride configuration
  • Task analysis
  • temporal resolution
  • Time-frequency analysis
  • Transformers

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification'. Together they form a unique fingerprint.

Cite this