Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering

  • Ze Fu
  • , Changmeng Zheng
  • , Yi Cai
  • , Qing Li
  • , Tao Wang

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

1 Citation (Scopus)

Abstract

Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.

Original languageEnglish
Title of host publicationAsia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data
EditorsLeong Hou U, Marc Spaniol, Yasushi Sakurai, Junying Chen
PublisherSpringer Science and Business Media Deutschland GmbH
Pages316-331
Number of pages16
ISBN (Print)9783030858957
DOIs
Publication statusPublished - Aug 2021
Event5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021 - Guangzhou, China
Duration: 23 Aug 202125 Aug 2021

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12858 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Country/TerritoryChina
CityGuangzhou
Period23/08/2125/08/21

Keywords

  • Domain adaptation
  • Modality-invariant co-learning
  • Visual question answering

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering'. Together they form a unique fingerprint.

Cite this