DITA: DETR with improved queries for end-to-end temporal action detection

Chongkai Lu, Man Wai Mak

Research output: Journal article publicationJournal articleAcademic researchpeer-review

Abstract

The DEtection TRansformer (DETR), with its elegant architecture and set prediction, has revolutionized object detection. However, DETR-like models have yet to achieve comparable success in temporal action detection (TAD). To address this gap, we introduce a series of improvements to the original DETR, proposing a new DETR-based model for TAD that achieves competitive performance relative to conventional TAD methods. Specifically, we adapt advanced techniques from DETR variants used in object detection, including deformable attention, denoising training, and selective query recollection. Furthermore, we propose several new techniques aimed at enhancing detection precision and model convergence speed, such as geographic query grouping and learnable proposals. Leveraging these innovations, we introduce a new model called DETR with Improved queries for Temporal Action Detection (DITA). DITA not only adheres to DETR's elegant design philosophy but is also competitive to state-of-the-art action detection models. Remarkably, it is the first TAD model to achieve an mAP over 70% on THUMOS14, outperforming the previous best DETR variant by 13.5 percentage points.

Original languageEnglish
Article number127914
Pages (from-to)1-12
Number of pages11
JournalNeurocomputing
Volume596
DOIs
Publication statusPublished - 1 Sept 2024

Keywords

  • Action recognition
  • Intelligent video system
  • Temporal action detection

ASJC Scopus subject areas

  • Artificial Intelligence
  • Cognitive Neuroscience
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'DITA: DETR with improved queries for end-to-end temporal action detection'. Together they form a unique fingerprint.

Cite this