Abstract
Conventional crowd counting methods exploit a large number of point annotations to train regression-based neural networks for density map estimation. However, laborious point annotations of human heads (strong supervision) are required in training. This paper presents a simple and effective crowd counting method with only image-level count annotations, i.e., the number of people in an image (weak supervision). Specifically, we first investigate three backbone networks and find the significance of the global information extracted by self-attention for weakly-supervised crowd counting. Then, we propose an effective network composed of a Transformer backbone and token channel attention module (T-CAM) in the counting head, where the attention in channels of tokens can compensate for the self-attention between tokens of the Transformer. Finally, a simple token fusion is proposed to obtain global information. Experimental results on two representative crowd counting benchmarks show the superiority of the proposed method, with an average 10% relative improvement compared with baselines. The code is publicly available at https://github.com/WangyiNTU/WSCC_TAF.
Original language | English |
---|---|
Pages (from-to) | 13456-13460 |
Number of pages | 5 |
Journal | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
DOIs | |
Publication status | Published - Apr 2024 |
Event | 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Seoul, Korea, Republic of Duration: 14 Apr 2024 → 19 Apr 2024 |
Keywords
- Crowd counting
- token attention
- token fusion
- vision transformer
- weak supervision
ASJC Scopus subject areas
- Software
- Signal Processing
- Electrical and Electronic Engineering