Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Weiwei Lin, Chenhang He, Man Wai Mak, Youzhi Tu

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

1 Citation (Scopus)

Abstract

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.

Original languageEnglish
Title of host publicationThe International Conference on Machine Learning (ICML)
Pages21065-21077
Number of pages13
Volume202
Publication statusPublished - Jul 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: 23 Jul 202329 Jul 2023

Publication series

NameProceedings of Machine Learning Research
ISSN (Print)2640-3498

Conference

Conference40th International Conference on Machine Learning, ICML 2023
Country/TerritoryUnited States
CityHonolulu
Period23/07/2329/07/23

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations'. Together they form a unique fingerprint.

Cite this