TY - GEN
T1 - Scene Graph with 3D Information for Change Captioning
AU - Liao, Zeming
AU - Huang, Qingbao
AU - Liang, Yu
AU - Fu, Mingyi
AU - Cai, Yi
AU - Li, Qing
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/10/17
Y1 - 2021/10/17
N2 - Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.
AB - Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.
KW - change captioning
KW - image difference description
KW - scene graph
UR - http://www.scopus.com/inward/record.url?scp=85119321503&partnerID=8YFLogxK
U2 - 10.1145/3474085.3475712
DO - 10.1145/3474085.3475712
M3 - Conference article published in proceeding or book
AN - SCOPUS:85119321503
T3 - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
SP - 5074
EP - 5082
BT - MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia
PB - Association for Computing Machinery, Inc
T2 - 29th ACM International Conference on Multimedia, MM 2021
Y2 - 20 October 2021 through 24 October 2021
ER -