TY - GEN
T1 - Multimodal Crowd Counting with Mutual Attention Transformers
AU - Wu, Zhengtao
AU - Liu, Lingbo
AU - Zhang, Yang
AU - Mao, Mingzhi
AU - Lin, Liang
AU - Li, Guanbin
N1 - Funding Information:
*Corresponding authors are Mingzhi Mao and Guanbin Li. This work was supported in part by the Guangdong Basic and Applied Basic Research Foundation under Grant No.2020B1515020048, in part by the National Natural Science Foundation of China under Grant No.61976250 and No.U1811463, and in part by the Guangzhou Science and technology project under Grant No.202102020633.
Publisher Copyright:
© 2022 IEEE.
PY - 2022/8
Y1 - 2022/8
N2 - Crowd counting is a fundamental yet challenging task that aims to automatically estimate the number of people in crowded scenes. Nowadays, with the rapid development of thermal and depth sensors, thermal images and depth maps become more accessible, which are proven to be beneficial information in boosting the performance of crowd counting. Consequently, we propose a Mutual Attention Transformer (MAT) module to fully leverage the complementary information of different modalities. Specifically, our MAT employs a cross-modal mutual attention mechanism to utilize the features of one modality to enhance the features of the other. Moreover, to improve performance by learning better visual representation and further exploiting modality-wise comple-mentarity, we design a self-supervised pre-training method based on cross-modal image reconstruction. Extensive experiments on two standard benchmarks (i.e., RGBT-CC and ShanghaiTechRGBD) show that the proposed method is effective and universal for multimodal crowd counting, outper-forming previous state-of-the-art methods.
AB - Crowd counting is a fundamental yet challenging task that aims to automatically estimate the number of people in crowded scenes. Nowadays, with the rapid development of thermal and depth sensors, thermal images and depth maps become more accessible, which are proven to be beneficial information in boosting the performance of crowd counting. Consequently, we propose a Mutual Attention Transformer (MAT) module to fully leverage the complementary information of different modalities. Specifically, our MAT employs a cross-modal mutual attention mechanism to utilize the features of one modality to enhance the features of the other. Moreover, to improve performance by learning better visual representation and further exploiting modality-wise comple-mentarity, we design a self-supervised pre-training method based on cross-modal image reconstruction. Extensive experiments on two standard benchmarks (i.e., RGBT-CC and ShanghaiTechRGBD) show that the proposed method is effective and universal for multimodal crowd counting, outper-forming previous state-of-the-art methods.
KW - Crowd Counting
KW - Multimodal
KW - Mutual Attention
KW - Self-Supervised Learning
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85137692634&partnerID=8YFLogxK
U2 - 10.1109/ICME52920.2022.9859777
DO - 10.1109/ICME52920.2022.9859777
M3 - Conference article published in proceeding or book
AN - SCOPUS:85137692634
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
BT - ICME 2022 - IEEE International Conference on Multimedia and Expo 2022, Proceedings
PB - IEEE Computer Society
T2 - 2022 IEEE International Conference on Multimedia and Expo, ICME 2022
Y2 - 18 July 2022 through 22 July 2022
ER -