Holistic-Guided Disentangled Learning With Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition

Tianshan Liu, Rui Zhao, Wenqi Jia, Kin Man Lam, Jun Kong

Research output: Journal article publicationJournal articleAcademic researchpeer-review

1 Citation (Scopus)


The popularity of wearable devices has increased the demands for the research on first-person activity recognition. However, most of the current first-person activity datasets are built based on the assumption that only the human–object interaction (HOI) activities, performed by the camera-wearer, are captured in the field of view. Since humans live in complicated scenarios, in addition to the first-person activities, it is likely that third-person activities performed by other people also appear. Analyzing and recognizing these two types of activities simultaneously occurring in a scene is important for the camera-wearer to understand the surrounding environments. To facilitate the research on concurrent first-and third-person activity recognition (CFT-AR), we first created a new activity dataset, namely PolyU concurrent first-and third-person (CFT) Daily, which exhibits distinct properties and challenges, compared with previous activity datasets. Since temporal asynchronism and appearance gap usually exist between the first-and third-person activities, it is crucial to learn robust representations from all the activity-related spatio-temporal positions. Thus, we explore both holistic scene-level and local instance-level (person-level) features to provide comprehensive and discriminative patterns for recognizing both first-and third-person activities. On the one hand, the holistic scene-level features are extracted by a 3-D convolutional neural network, which is trained to mine shared and sample-unique semantics between video pairs, via two well-designed attention-based modules and a self-knowledge distillation (SKD) strategy. On the other hand, we further leverage the extracted holistic features to guide the learning of instance-level features in a disentangled fashion, which aims to discover both spatially conspicuous patterns and temporally varied, yet critical, cues. Experimental results on the PolyU CFT Daily dataset validate that our method achieves the state-of-the-art performance.

Original languageEnglish
Article number9887964
Pages (from-to)1-15
Number of pages15
JournalIEEE Transactions on Neural Networks and Learning Systems
Publication statusAccepted/In press - 2022


  • Activity recognition
  • Aggregates
  • Concurrent first-and third-person activity recognition (CFT-AR)
  • cross-video semantics mining (CVSM)
  • Feature extraction
  • Hardware design languages
  • holistic-guided disentangled learning (HDL)
  • self-knowledge distillation (SKD)
  • Semantics
  • Task analysis
  • Training

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence


Dive into the research topics of 'Holistic-Guided Disentangled Learning With Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition'. Together they form a unique fingerprint.

Cite this