Towards Bridged Vision and Language: Learning Cross-Modal Knowledge Representation for Relation Extraction

Junhao Feng, Guohua Wang, Changmeng Zheng, Yi Cai, Ze Fu, Yaowei Wang, Xiao Yong Wei, Qing Li

Research output: Journal article publicationJournal articleAcademic researchpeer-review

5 Citations (Scopus)

Abstract

In natural language processing, relation extraction (RE) is to detect and classify the semantic relationship of two given entities within a sentence. Previous RE methods consider only the textual contents and suffer performance decline in social media when texts lack contexts. Incorporating text-related visual information can supplement the missing semantics for relation extraction in social media posts. However, textual relations are usually abstract and of high-level semantics, which causes the semantic gap between visual contents and textual expressions. In this paper, we propose RECK - a neural network for relation extraction with cross-modal knowledge representations. Different from previous multimodal methods training a common subspace for all modalities, we bridge the semantic gaps by explicitly selecting knowledge paths from external knowledge through the cross-modal object-entity pairs. We further extend the paths into a knowledge graph, and adopt a graph attention network to capture the multi-grained relevant concepts which can provide higher level and key semantics information from external knowledge. Besides, we employ a cross-modal attention mechanism to align and fuse the multimodal information. Experimental results on a multimodal RE dataset show that our model achieves new state-of-the-art performance with knowledge evidence.

Original languageEnglish
Pages (from-to)561-575
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume34
Issue number1
DOIs
Publication statusPublished - 1 Jan 2024

Keywords

  • graph attention network
  • knowledge graphs1
  • Multimodal relation extraction

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Towards Bridged Vision and Language: Learning Cross-Modal Knowledge Representation for Relation Extraction'. Together they form a unique fingerprint.

Cite this