Self-Supervised Video Representation Learning by Serial Restoration With Elastic Complexity

Ziyu Chen, Hanli Wang, Chang Wen Chen

Research output: Journal article publicationJournal articleAcademic researchpeer-review

2 Citations (Scopus)

Abstract

Self-supervised video representation learning leaves out heavy manual annotation by automatically excavating supervisory signals. Although contrastive learning based approaches exhibit superior performances, pretext task based approaches still deserve further study. This is because the pretext tasks exploit the nature of data and encourage feature extractors to learn spatiotemporal logic by discovering dependencies among video clips or cubes, without manual engineering on data augmentations or manual construction of contrastive pairs. To utilize chronological property more effectively and efficiently, this work proposes a novel pretext task, named serial restoration of shuffled clips (SRSC), disentangled by an elaborately designed task network composed of an order-aware encoder and a serial restoration decoder. In contrast to other order based pretext tasks that formulate clip order recognition as a one-step classification problem, the proposed SRSC task restores shuffled clips into the right order in multiple steps. Owing to the excellent elasticity of SRSC, a novel taxonomy of curriculum learning is further proposed to equip SRSC with different pre-training strategies. According to the factors that affect the complexity of solving the SRSC task, the proposed curriculum learning strategies can be categorized into task based, model based and data based. Extensive experiments are conducted on the subdivided strategies to explore their effectiveness and noteworthy laws. Compared with existing approaches, this work demonstrates that the proposed approach achieves state-of-the-art performances in pretext task based self-supervised video representation learning and a majority of the proposed strategies further boost the performance of downstream tasks. For the first time, the features pre-trained by the pretext tasks are applied to video captioning by feature-level early fusion, and enhance the input of existing approaches as a lightweight plugin.

Original languageEnglish
Pages (from-to)2235-2248
Number of pages14
JournalIEEE Transactions on Multimedia
Volume26
DOIs
Publication statusPublished - Jul 2023

Keywords

  • action recognition
  • curriculum learning
  • nearest neighbor retrieval
  • pretext task
  • Self-supervised learning
  • video captioning
  • video representation learning

ASJC Scopus subject areas

  • Signal Processing
  • Media Technology
  • Computer Science Applications
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Self-Supervised Video Representation Learning by Serial Restoration With Elastic Complexity'. Together they form a unique fingerprint.

Cite this