Abstract
Existing Knowledge Distillation (KD) methods typically focus on transferring knowledge from a large-capacity teacher to a low-capacity student model achieving substantial success in unimodal knowledge transfer. However existing methods can hardly be extended to Cross-Modal Knowledge Distillation (CMKD) where the knowledge is transferred from a teacher modality to a different student modality with inference only on the distilled student modality. We empirically reveal that the modality gap i.e. modality imbalance and soft label misalignment incurs the ineffectiveness of traditional KD in CMKD. As a solution we propose a novel \underline C ustomized \underline C rossmodal \underline K nowledge \underline D istillation (C^2KD). Specifically to alleviate the modality gap the pre-trained teacher performs bidirectional distillation with the student to provide customized knowledge. The On-the-Fly Selection Distillation(OFSD) strategy is applied to selectively filter out the samples with misaligned soft labels where we distill cross-modal knowledge from non-target classes to avoid the modality imbalance issue. To further provide receptive cross-modal knowledge proxy student and teacher inheriting unimodal and cross-modal knowledge is formulated to progressively transfer cross-modal knowledge through bidirectional distillation. Experimental results on audio-visual image-text and RGB-depth datasets demonstrate that our method can effectively transfer knowledge across modalities achieving superior performance against traditional KD by a large margin.
Original language | English |
---|---|
Title of host publication | Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. |
Pages | 16006-16015 |
Publication status | Published - Jun 2024 |