Weakly-Supervised Crowd Counting with Token Attention and Fusion: A Simple and Effective Baseline

Yi Wang, Qiongyang Hu, Lap Pui Chau

Research output: Journal article publicationConference articleAcademic researchpeer-review

Abstract

Conventional crowd counting methods exploit a large number of point annotations to train regression-based neural networks for density map estimation. However, laborious point annotations of human heads (strong supervision) are required in training. This paper presents a simple and effective crowd counting method with only image-level count annotations, i.e., the number of people in an image (weak supervision). Specifically, we first investigate three backbone networks and find the significance of the global information extracted by self-attention for weakly-supervised crowd counting. Then, we propose an effective network composed of a Transformer backbone and token channel attention module (T-CAM) in the counting head, where the attention in channels of tokens can compensate for the self-attention between tokens of the Transformer. Finally, a simple token fusion is proposed to obtain global information. Experimental results on two representative crowd counting benchmarks show the superiority of the proposed method, with an average 10% relative improvement compared with baselines. The code is publicly available at https://github.com/WangyiNTU/WSCC_TAF.

Original languageEnglish
Pages (from-to)13456-13460
Number of pages5
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
DOIs
Publication statusPublished - Apr 2024
Event49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of
Duration: 14 Apr 202419 Apr 2024

Keywords

  • Crowd counting
  • token attention
  • token fusion
  • vision transformer
  • weak supervision

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Weakly-Supervised Crowd Counting with Token Attention and Fusion: A Simple and Effective Baseline'. Together they form a unique fingerprint.

Cite this