TY - JOUR
T1 - Towards Bridged Vision and Language: Learning Cross-Modal Knowledge Representation for Relation Extraction
AU - Feng, Junhao
AU - Wang, Guohua
AU - Zheng, Changmeng
AU - Cai, Yi
AU - Fu, Ze
AU - Wang, Yaowei
AU - Wei, Xiao Yong
AU - Li, Qing
N1 - Publisher Copyright:
© 1991-2012 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - In natural language processing, relation extraction (RE) is to detect and classify the semantic relationship of two given entities within a sentence. Previous RE methods consider only the textual contents and suffer performance decline in social media when texts lack contexts. Incorporating text-related visual information can supplement the missing semantics for relation extraction in social media posts. However, textual relations are usually abstract and of high-level semantics, which causes the semantic gap between visual contents and textual expressions. In this paper, we propose RECK - a neural network for relation extraction with cross-modal knowledge representations. Different from previous multimodal methods training a common subspace for all modalities, we bridge the semantic gaps by explicitly selecting knowledge paths from external knowledge through the cross-modal object-entity pairs. We further extend the paths into a knowledge graph, and adopt a graph attention network to capture the multi-grained relevant concepts which can provide higher level and key semantics information from external knowledge. Besides, we employ a cross-modal attention mechanism to align and fuse the multimodal information. Experimental results on a multimodal RE dataset show that our model achieves new state-of-the-art performance with knowledge evidence.
AB - In natural language processing, relation extraction (RE) is to detect and classify the semantic relationship of two given entities within a sentence. Previous RE methods consider only the textual contents and suffer performance decline in social media when texts lack contexts. Incorporating text-related visual information can supplement the missing semantics for relation extraction in social media posts. However, textual relations are usually abstract and of high-level semantics, which causes the semantic gap between visual contents and textual expressions. In this paper, we propose RECK - a neural network for relation extraction with cross-modal knowledge representations. Different from previous multimodal methods training a common subspace for all modalities, we bridge the semantic gaps by explicitly selecting knowledge paths from external knowledge through the cross-modal object-entity pairs. We further extend the paths into a knowledge graph, and adopt a graph attention network to capture the multi-grained relevant concepts which can provide higher level and key semantics information from external knowledge. Besides, we employ a cross-modal attention mechanism to align and fuse the multimodal information. Experimental results on a multimodal RE dataset show that our model achieves new state-of-the-art performance with knowledge evidence.
KW - graph attention network
KW - knowledge graphs1
KW - Multimodal relation extraction
UR - http://www.scopus.com/inward/record.url?scp=85162693333&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2023.3284474
DO - 10.1109/TCSVT.2023.3284474
M3 - Journal article
AN - SCOPUS:85162693333
SN - 1051-8215
VL - 34
SP - 561
EP - 575
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 1
ER -