Multimodal Crowd Counting with Mutual Attention Transformers

Zhengtao Wu, Lingbo Liu, Yang Zhang, Mingzhi Mao, Liang Lin, Guanbin Li

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

11 Citations (Scopus)


Crowd counting is a fundamental yet challenging task that aims to automatically estimate the number of people in crowded scenes. Nowadays, with the rapid development of thermal and depth sensors, thermal images and depth maps become more accessible, which are proven to be beneficial information in boosting the performance of crowd counting. Consequently, we propose a Mutual Attention Transformer (MAT) module to fully leverage the complementary information of different modalities. Specifically, our MAT employs a cross-modal mutual attention mechanism to utilize the features of one modality to enhance the features of the other. Moreover, to improve performance by learning better visual representation and further exploiting modality-wise comple-mentarity, we design a self-supervised pre-training method based on cross-modal image reconstruction. Extensive experiments on two standard benchmarks (i.e., RGBT-CC and ShanghaiTechRGBD) show that the proposed method is effective and universal for multimodal crowd counting, outper-forming previous state-of-the-art methods.

Original languageEnglish
Title of host publicationICME 2022 - IEEE International Conference on Multimedia and Expo 2022, Proceedings
PublisherIEEE Computer Society
ISBN (Electronic)9781665485630
Publication statusPublished - Aug 2022
Event2022 IEEE International Conference on Multimedia and Expo, ICME 2022 - Taipei, Taiwan
Duration: 18 Jul 202222 Jul 2022

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X


Conference2022 IEEE International Conference on Multimedia and Expo, ICME 2022


  • Crowd Counting
  • Multimodal
  • Mutual Attention
  • Self-Supervised Learning
  • Transformer

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications


Dive into the research topics of 'Multimodal Crowd Counting with Mutual Attention Transformers'. Together they form a unique fingerprint.

Cite this