TY - JOUR
T1 - Deep Cross-modal Representation Learning and Distillation for Illumination-invariant Pedestrian Detection
AU - Liu, Tianshan
AU - Lam, Kin Man
AU - Zhao, Rui
AU - Qiu, Guoping
N1 - This work was supported in part by the Key-Area Research and Development Program of Guangdong Province 2020 under Project 76 and in part by the Education Department of Guangdong Province, China, under Project 2019KZDZX1028.
Publisher Copyright:
© 1991-2012 IEEE.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Integrating multispectral data has been demonstrated to be an effective solution for illumination-invariant pedestrian detection, in particular, RGB and thermal images can provide complementary information to handle light variations. However, most of the current multispectral detectors fuse the multimodal features by simple concatenation, without discovering their latent relationships. In this paper, we propose a cross-modal feature learning (CFL) module, based on a split-and-aggregation strategy, to explicitly explore both the shared and modalityspecific representations between paired RGB and thermal images. We insert the proposed CFL module into multiple layers of a twobranch-based pedestrian detection network, to learn the crossmodal representations in diverse semantic levels. By introducing a segmentation-based auxiliary task, the multimodal network is trained end-to-end by jointly optimizing a multi-task loss. On the other hand, to alleviate the reliance of existing multispectral pedestrian detectors on thermal images, we propose a knowledge distillation framework to train a student detector, which only receives RGB images as input and distills the cross-modal representations guided by a well-trained multimodal teacher detector. In order to facilitate the cross-modal knowledge distillation, we design different distillation loss functions for the feature, detection and segmentation levels. Experimental results on the public KAIST multispectral pedestrian benchmark validate that the proposed cross-modal representation learning and distillation method achieves robust performance.
AB - Integrating multispectral data has been demonstrated to be an effective solution for illumination-invariant pedestrian detection, in particular, RGB and thermal images can provide complementary information to handle light variations. However, most of the current multispectral detectors fuse the multimodal features by simple concatenation, without discovering their latent relationships. In this paper, we propose a cross-modal feature learning (CFL) module, based on a split-and-aggregation strategy, to explicitly explore both the shared and modalityspecific representations between paired RGB and thermal images. We insert the proposed CFL module into multiple layers of a twobranch-based pedestrian detection network, to learn the crossmodal representations in diverse semantic levels. By introducing a segmentation-based auxiliary task, the multimodal network is trained end-to-end by jointly optimizing a multi-task loss. On the other hand, to alleviate the reliance of existing multispectral pedestrian detectors on thermal images, we propose a knowledge distillation framework to train a student detector, which only receives RGB images as input and distills the cross-modal representations guided by a well-trained multimodal teacher detector. In order to facilitate the cross-modal knowledge distillation, we design different distillation loss functions for the feature, detection and segmentation levels. Experimental results on the public KAIST multispectral pedestrian benchmark validate that the proposed cross-modal representation learning and distillation method achieves robust performance.
KW - cross-modal representation
KW - Detectors
KW - Feature extraction
KW - Illumination-invariant pedestrian detection
KW - Image segmentation
KW - knowledge distillation
KW - Lighting
KW - multispectral fusion
KW - Semantics
KW - Task analysis
KW - Training
UR - http://www.scopus.com/inward/record.url?scp=85101738767&partnerID=8YFLogxK
U2 - 10.1109/TCSVT.2021.3060162
DO - 10.1109/TCSVT.2021.3060162
M3 - Journal article
AN - SCOPUS:85101738767
SN - 1051-8215
VL - 32
SP - 315
EP - 329
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
IS - 1
M1 - 9357413
ER -