Deep Cross-modal Representation Learning and Distillation for Illumination-invariant Pedestrian Detection

Tianshan Liu, Kin Man Lam, Rui Zhao, Guoping Qiu

Research output: Journal article publicationJournal articleAcademic researchpeer-review

77 Citations (Scopus)

Abstract

Integrating multispectral data has been demonstrated to be an effective solution for illumination-invariant pedestrian detection, in particular, RGB and thermal images can provide complementary information to handle light variations. However, most of the current multispectral detectors fuse the multimodal features by simple concatenation, without discovering their latent relationships. In this paper, we propose a cross-modal feature learning (CFL) module, based on a split-and-aggregation strategy, to explicitly explore both the shared and modalityspecific representations between paired RGB and thermal images. We insert the proposed CFL module into multiple layers of a twobranch-based pedestrian detection network, to learn the crossmodal representations in diverse semantic levels. By introducing a segmentation-based auxiliary task, the multimodal network is trained end-to-end by jointly optimizing a multi-task loss. On the other hand, to alleviate the reliance of existing multispectral pedestrian detectors on thermal images, we propose a knowledge distillation framework to train a student detector, which only receives RGB images as input and distills the cross-modal representations guided by a well-trained multimodal teacher detector. In order to facilitate the cross-modal knowledge distillation, we design different distillation loss functions for the feature, detection and segmentation levels. Experimental results on the public KAIST multispectral pedestrian benchmark validate that the proposed cross-modal representation learning and distillation method achieves robust performance.

Original languageEnglish
Article number9357413
Pages (from-to)315-329
Number of pages15
JournalIEEE Transactions on Circuits and Systems for Video Technology
Volume32
Issue number1
DOIs
Publication statusPublished - 1 Jan 2022

Keywords

  • cross-modal representation
  • Detectors
  • Feature extraction
  • Illumination-invariant pedestrian detection
  • Image segmentation
  • knowledge distillation
  • Lighting
  • multispectral fusion
  • Semantics
  • Task analysis
  • Training

ASJC Scopus subject areas

  • Media Technology
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Deep Cross-modal Representation Learning and Distillation for Illumination-invariant Pedestrian Detection'. Together they form a unique fingerprint.

Cite this