Abstract
Multimodal classification of social media is used to classify data from different modalities into different categories, which is essential for understanding user behavior on the web. In this paper, we focus on classifying image-text pairs, specifically user-generated content on social media. Recently, the transformer network, a kind of self-attention network, has been widely studied in the disciplines of visual computing and language processing. In the attention mechanism, positive correlation is considered. However, multimedia content posted on social media is diverse. Images and text are not always consistent, and contrary information is also helpful for representation. Therefore, it is equally important to detect conflicts based on negative or inverse attention. Inspired by the attention mechanism, we propose a novel model, namely Crossmodal Bipolar Attention Network (CBAN). Different from existing positive dot-product and additive attention mechanisms, we propose a bipolar attention mechanism, which fuses visual and textual information through their direct and inverse semantic relationships to classify multimodal data. We conducted experiments on multiple multimodal classification data sets, for performing sentiment analysis, sarcasm detection, crisis categorization and hate-speech detection. Experimental results show that our proposed CBAN consistently outperforms state-of-the-art methods in all classification tasks.
Original language | English |
---|---|
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | Neurocomputing |
Volume | 514 |
DOIs | |
Publication status | Published - 1 Dec 2022 |
Keywords
- Attention mechanism
- Multimodal classification
- Social media mining
ASJC Scopus subject areas
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence