TY - GEN
T1 - Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
AU - Fu, Ze
AU - Zheng, Changmeng
AU - Cai, Yi
AU - Li, Qing
AU - Wang, Tao
N1 - Funding Information:
Acknowledgments. This work was supported by National Natural Science Foundation of China (No. 62076100), National Key Research and Development Program of China (Standard knowledge graph for epidemic prevention and production recovering intelligent service platform and its applications), the Fundamental Research Funds for the Central Universities, SCUT (No. D2201300, D2210010), the Science and Technology Programs of Guangzhou (201902010046), the Science and Technology Planning Project of Guangdong Province (No. 2020B0101100002).
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.
AB - Visual Question Answering (VQA) is a typical multimodal task with significant development prospect on web application. In order to answer the question based on the corresponding image, a VQA model needs to utilize the information from different modality efficiently. Although the multimodal fusion methods such as attention mechanism make significant contribution for VQA, these methods try to co-learn the multimodal features directly, ignoring the large gap between different modality and thus poor aligning the semantic. In this paper, we propose a Cross-Modality Adversarial Network (CMAN) to address this limitation. Our method combines cross-modality adversarial learning with modality-invariant attention learning aiming to learn the modality-invariant features for better semantic alignment and higher answer prediction accuracy. The accuracy of model achieves 70.81% on the test-dev split on the VQA-v2 dataset. Our results also show that the model narrows the gap between different modalities effectively and improves the alignment performance of the multimodal information.
KW - Domain adaptation
KW - Modality-invariant co-learning
KW - Visual question answering
UR - http://www.scopus.com/inward/record.url?scp=85115199830&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-85896-4_25
DO - 10.1007/978-3-030-85896-4_25
M3 - Conference article published in proceeding or book
AN - SCOPUS:85115199830
SN - 9783030858957
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 316
EP - 331
BT - Web and Big Data - 5th International Joint Conference, APWeb-WAIM 2021, Proceedings
A2 - U, Leong Hou
A2 - Spaniol, Marc
A2 - Sakurai, Yasushi
A2 - Chen, Junying
PB - Springer Science and Business Media Deutschland GmbH
T2 - 5th International Joint Conference on Asia-Pacific Web and Web-Age Information Management, APWeb-WAIM 2021
Y2 - 23 August 2021 through 25 August 2021
ER -