Abstract
This paper proposes a dense fusion transformer (DFT) framework to integrate textual, acoustic, and visual information for multimodal affective computing. DFT exploits a modality-shared transformer (MT) module to extract the modality-shared features by modelling unimodal, bimodal, and trimodal interactions jointly. MT constructs a series of dense fusion blocks to fuse utterance-level sequential features of the multiple modalities from the perspectives of low-level and high-level semantics. In particular, MT adopts local and global transformers to learn modality-shared representations by modelling inter- and intra-modality interactions. Furthermore, we devise a modality-specific representation (MR) module with a soft orthogonality constraint to penalize the distance between modality-specific and modality-shared representations, which are fused by a transformer to make affective predictions. Extensive experiments conducted on five public benchmark datasets show that DFT outperforms the state-of-the-art baselines.
Original language | English |
---|---|
Pages (from-to) | 1-13 |
Number of pages | 13 |
Journal | IEEE Transactions on Multimedia |
DOIs | |
Publication status | Published - Sept 2022 |
Keywords
- Affective computing
- Computational modeling
- Discrete Fourier transforms
- Feature extraction
- Fuses
- multimodal emotion recognition
- multimodal fusion
- multimodal representation learning
- multimodal sentiment analysis
- Transformers
- Visualization
ASJC Scopus subject areas
- Signal Processing
- Media Technology
- Computer Science Applications
- Electrical and Electronic Engineering