Multimodal Affective Computing with Dense Fusion Transformer for Inter- and Intra-modality Interactions

Huan Deng, Zhenguo Yang, Tianyong Hao, Qing Li, Wenyin Liu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

11 Citations (Scopus)

Abstract

This paper proposes a dense fusion transformer (DFT) framework to integrate textual, acoustic, and visual information for multimodal affective computing. DFT exploits a modality-shared transformer (MT) module to extract the modality-shared features by modelling unimodal, bimodal, and trimodal interactions jointly. MT constructs a series of dense fusion blocks to fuse utterance-level sequential features of the multiple modalities from the perspectives of low-level and high-level semantics. In particular, MT adopts local and global transformers to learn modality-shared representations by modelling inter- and intra-modality interactions. Furthermore, we devise a modality-specific representation (MR) module with a soft orthogonality constraint to penalize the distance between modality-specific and modality-shared representations, which are fused by a transformer to make affective predictions. Extensive experiments conducted on five public benchmark datasets show that DFT outperforms the state-of-the-art baselines.

Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Multimedia
DOIs
Publication statusPublished - Sept 2022

Keywords

  • Affective computing
  • Computational modeling
  • Discrete Fourier transforms
  • Feature extraction
  • Fuses
  • multimodal emotion recognition
  • multimodal fusion
  • multimodal representation learning
  • multimodal sentiment analysis
  • Transformers
  • Visualization

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Multimodal Affective Computing with Dense Fusion Transformer for Inter- and Intra-modality Interactions'. Together they form a unique fingerprint.

Cite this