Abstract
Video scene graph generation has been an emerging research topic, which aims to interpret a video as a temporally-evolving graph structure by representing video objects as nodes and their relations as edges. Existing approaches predominantly follow a multi-step scheme, including frame-level object detection, relation recognition and temporal association. Although effective, these approaches neglect the mutual interactions between independent steps, resulting in a sub-optimal solution. We present a novel end-to-end framework for video scene graph generation, which naturally unifies object detection, object tracking, and relation recognition via a new Transformer structure, namely Temporal Propagation Transformer (TPT). Particularly, TPT extends the existing Transformer-based object detector (e.g., DETR) along the temporal dimension by involving a query propagation module, which can additionally associate the detected instances by identities across frames. A temporal dynamics encoder is then leveraged to dynamically enrich the features of the detected instances for relation recognition by attending to their historic states in previous frames. Meanwhile, the relation propagation strategy is devised to emphasize the temporal consistency of relation recognition results among adjacent frames. Extensive experiments conducted on VidHOI and Action Genome benchmarks demonstrate the superior performance of the proposed TPT over the state-of-the-art methods.
Original language | English |
---|---|
Pages (from-to) | 1613-1625 |
Number of pages | 13 |
Journal | IEEE Transactions on Multimedia |
Volume | 26 |
DOIs | |
Publication status | Published - Jun 2023 |
Keywords
- temporal propagation
- transformer
- Video scene graph generation
ASJC Scopus subject areas
- Signal Processing
- Media Technology
- Computer Science Applications
- Electrical and Electronic Engineering