CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Ruixiang Jiang, Lingbo Liu, Changwen Chen

Research output: Chapter in book / Conference proceedingConference article published in proceeding or bookAcademic researchpeer-review

5 Citations (Scopus)

Abstract

Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: https://github.com/songrise/CLIP-Count. https://github.com/songrise/CLIP-Count.

Original languageEnglish
Title of host publicationMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia
PublisherAssociation for Computing Machinery, Inc
Pages4535-4545
Number of pages11
ISBN (Electronic)9798400701085
DOIs
Publication statusPublished - 26 Oct 2023
Event31st ACM International Conference on Multimedia, MM 2023 - Ottawa, Canada
Duration: 29 Oct 20233 Nov 2023

Publication series

NameMM 2023 - Proceedings of the 31st ACM International Conference on Multimedia

Conference

Conference31st ACM International Conference on Multimedia, MM 2023
Country/TerritoryCanada
CityOttawa
Period29/10/233/11/23

Keywords

  • class-agnostic object counting
  • clip
  • text-guided
  • zero-shot

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction
  • Software

Fingerprint

Dive into the research topics of 'CLIP-Count: Towards Text-Guided Zero-Shot Object Counting'. Together they form a unique fingerprint.

Cite this