End-to-End Video Scene Graph Generation With Temporal Propagation Transformer

Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, Chang Wen Chen

Research output: Journal article publicationJournal articleAcademic researchpeer-review

2 Citations (Scopus)

Abstract

Video scene graph generation has been an emerging research topic, which aims to interpret a video as a temporally-evolving graph structure by representing video objects as nodes and their relations as edges. Existing approaches predominantly follow a multi-step scheme, including frame-level object detection, relation recognition and temporal association. Although effective, these approaches neglect the mutual interactions between independent steps, resulting in a sub-optimal solution. We present a novel end-to-end framework for video scene graph generation, which naturally unifies object detection, object tracking, and relation recognition via a new Transformer structure, namely Temporal Propagation Transformer (TPT). Particularly, TPT extends the existing Transformer-based object detector (e.g., DETR) along the temporal dimension by involving a query propagation module, which can additionally associate the detected instances by identities across frames. A temporal dynamics encoder is then leveraged to dynamically enrich the features of the detected instances for relation recognition by attending to their historic states in previous frames. Meanwhile, the relation propagation strategy is devised to emphasize the temporal consistency of relation recognition results among adjacent frames. Extensive experiments conducted on VidHOI and Action Genome benchmarks demonstrate the superior performance of the proposed TPT over the state-of-the-art methods.

Original languageEnglish
Pages (from-to)1613-1625
Number of pages13
JournalIEEE Transactions on Multimedia
Volume26
DOIs
Publication statusPublished - Jun 2023

Keywords

  • temporal propagation
  • transformer
  • Video scene graph generation

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'End-to-End Video Scene Graph Generation With Temporal Propagation Transformer'. Together they form a unique fingerprint.

Cite this