Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection

  • Duc Tuan Truong
  • , Ruijie Tao
  • , Tuan Nguyen
  • , Hieu Thi Luong
  • , Kong Aik Lee
  • , Eng Siong Chng

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

15 Citations (Scopus)

Abstract

Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.

Original languageEnglish
Title of host publicationEnglish
Pages537-541
Number of pages5
DOIs
Publication statusPublished - Sept 2024
Event25th Interspeech Conferece 2024 - Kos Island, Greece
Duration: 1 Sept 20245 Sept 2024

Publication series

NameProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
PublisherInternational Speech Communication Association
ISSN (Print)2308-457X

Conference

Conference25th Interspeech Conferece 2024
Country/TerritoryGreece
CityKos Island
Period1/09/245/09/24

Keywords

  • ASVspoof challenges
  • attention learning
  • synthetic speech detection

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Fingerprint

Dive into the research topics of 'Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection'. Together they form a unique fingerprint.

Cite this