Abstract
Knowledge Distillation (KD) aims to transfer knowledge from a high-capacity teacher model to a lightweight student model, thereby enabling the student model to attain a level of performance that would be unattainable through conventional training methods. In conventional KD, the loss function's temperature that controls the smoothness of class distributions is fixed. We argue that distribution smoothness is critical to the transfer of knowledge and propose an adversarial adaptive temperature module to set the temperature dynamically during training to enhance the student's performance. Using the concept of decoupled knowledge distillation (DKD), we separate the Kullback–Leibler (KL) divergence into a target-class term and a non-target-class term. However, unlike DKD, we adversarially update the temperature coefficients of the target and non-target classes to maximize the distillation loss. We named our method Adversarially Adaptive Temperature for DKD (AAT-DKD). Our approach demonstrates improvements over KD methods across three test sets of Voxceleb1 for two student models (x-vector and ECAPA-TDNN). Specifically, compared to the traditional KD and DKD, our method achieves a remarkable reduction of 17.78% and 11.90% in EER using ECAPA-TDNN speaker embedding. Moreover, our method performs well on CN-Celeb and VoxSRC21, further highlighting its robustness and effectiveness across different datasets.
Original language | English |
---|---|
Article number | 129481 |
Journal | Neurocomputing |
Volume | 624 |
DOIs | |
Publication status | Published - 1 Apr 2025 |
Keywords
- Adaptive temperature
- Adversarial learning
- Knowledge distillation
- Speaker verification
ASJC Scopus subject areas
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence