TY - GEN
T1 - Multi-scale Attentive Fusion Network for Remote Sensing Image Change Captioning
AU - Chen, Cai
AU - Wang, Yi
AU - Yap, Kim Hui
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024/7
Y1 - 2024/7
N2 - Remote-sensing Image Change Captioning (RSICC) aims to automatically generate sentences describing the difference of content in remote-sensing bitemporal images. Most of the methods often address shortcomings in model architecture to enhance previous work, overlooking the distinctive characteristics that set remote sensing images apart from natural images, such as recognizing the change of objects with various scales (e.g., small/large-scale objects). By considering the difference, we proposed a Multi-scale Attentive Fusion Network (MAF-Net) to adaptively capture and describe the object change with a wide range of scales. The MAF-Net first extracts multi-scale visual features of bitemporal images from different stages of the CNN backbone, then captures the changes in each pair of the features with the proposed Multi-scale Change Aware Encoders (MCAE). Specifically, the MCAE captures the change-aware discriminative information over the paired multi-scale bitemporal features by Transformer-based different and content cross-attention encoding. Furthermore, a Gated Attentive Fusion (GAF) module is introduced to adaptively aggregate the relevant change-aware features to enhance the change caption performance. We evaluate the effectiveness of our proposed method on two RSICC datasets (e.g., LEVIR-CC and LEVIRCCD), and experimental results demonstrate that our method achieves state-of-the-art performance.
AB - Remote-sensing Image Change Captioning (RSICC) aims to automatically generate sentences describing the difference of content in remote-sensing bitemporal images. Most of the methods often address shortcomings in model architecture to enhance previous work, overlooking the distinctive characteristics that set remote sensing images apart from natural images, such as recognizing the change of objects with various scales (e.g., small/large-scale objects). By considering the difference, we proposed a Multi-scale Attentive Fusion Network (MAF-Net) to adaptively capture and describe the object change with a wide range of scales. The MAF-Net first extracts multi-scale visual features of bitemporal images from different stages of the CNN backbone, then captures the changes in each pair of the features with the proposed Multi-scale Change Aware Encoders (MCAE). Specifically, the MCAE captures the change-aware discriminative information over the paired multi-scale bitemporal features by Transformer-based different and content cross-attention encoding. Furthermore, a Gated Attentive Fusion (GAF) module is introduced to adaptively aggregate the relevant change-aware features to enhance the change caption performance. We evaluate the effectiveness of our proposed method on two RSICC datasets (e.g., LEVIR-CC and LEVIRCCD), and experimental results demonstrate that our method achieves state-of-the-art performance.
KW - Image Change Captioning (ICC)
KW - Multi-scale Change Awareness
KW - Remote Sensing (RS)
UR - http://www.scopus.com/inward/record.url?scp=85198531426&partnerID=8YFLogxK
U2 - 10.1109/ISCAS58744.2024.10558583
DO - 10.1109/ISCAS58744.2024.10558583
M3 - Conference article published in proceeding or book
AN - SCOPUS:85198531426
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
BT - ISCAS 2024 - IEEE International Symposium on Circuits and Systems
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE International Symposium on Circuits and Systems, ISCAS 2024
Y2 - 19 May 2024 through 22 May 2024
ER -