A three-stream fusion and self-differential attention network for multi-modal crowd counting

Haihan Tang, Yi Wang, Zhiping Lin, Lap Pui Chau, Huiping Zhuang

Research output: Journal article publicationJournal articleAcademic researchpeer-review


Multi-modal crowd counting aims at using multiple types of data, like RGB-Thermal and RGB-Depth, to count the number of people in crowded scenes. Current methods mainly focus on two-stream multi-modal information fusing in the encoder and single-scale semantic features in the decoder. In this paper, we propose an end-to-end three-stream fusion and self-differential attention network to simultaneously address the multi-modal fusion and scale variation problems for multi-modal crowd counting. Specifically, the encoder adopts three-stream fusion to fuse stage-wise modality-paired and modality-specific features. The decoder applies a self-differential attention mechanism on multi-level fused features to extract basic and differential information adaptively, and finally, the counting head predicts the density map. Experimental results on RGB-T and RGB-D benchmarks show the superiority of our proposed method compared with the state-of-the-art multi-modal crowd counting methods. Ablation studies and visualization demonstrate the advantages of the proposed modules in our model.

Original languageEnglish
Pages (from-to)35-41
Number of pages7
JournalPattern Recognition Letters
Publication statusPublished - Jul 2024


  • Crowd counting
  • Multi-modal data
  • Self-differential attention
  • Three-stream fusion

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence


Dive into the research topics of 'A three-stream fusion and self-differential attention network for multi-modal crowd counting'. Together they form a unique fingerprint.

Cite this