TY - GEN
T1 - Interpretation-Empowered Neural Cleanse for Backdoor Attacks
AU - Ning, Liang Bo
AU - Dai, Zeyu
AU - Su, Jingran
AU - Pan, Chao
AU - Wang, Luning
AU - Fan, Wenqi
AU - Li, Qing
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/5/13
Y1 - 2024/5/13
N2 - Backdoor attacks have posed a significant threat to deep neural networks, highlighting the need for robust defense strategies. Previous research has demonstrated that attribution maps change substantially when exposed to attacks, suggesting the potential of interpreters in detecting adversarial examples. However, most existing defense methods against backdoor attacks overlook the untapped capabilities of interpreters, failing to fully leverage their potential. In this paper, we propose a novel approach called interpretation-empowered neural cleanse (IENC) for defending backdoor attacks. Specifically, integrated gradient (IG) is adopted to bridge the interpreters and classifiers to reverse and reconstruct the high-quality backdoor trigger. Then, an interpretation-empowered adaptative pruning strategy (IEAPS) is proposed to cleanse the backdoor-related neurons without the pre-defined threshold. Additionally, a hybrid model patching approach is employed to integrate the IEAPS and preprocessing techniques to enhance the defense performance. Comprehensive experiments are constructed on various datasets, demonstrating the potential of interpretations in defending backdoor attacks and the superiority of the proposed method.
AB - Backdoor attacks have posed a significant threat to deep neural networks, highlighting the need for robust defense strategies. Previous research has demonstrated that attribution maps change substantially when exposed to attacks, suggesting the potential of interpreters in detecting adversarial examples. However, most existing defense methods against backdoor attacks overlook the untapped capabilities of interpreters, failing to fully leverage their potential. In this paper, we propose a novel approach called interpretation-empowered neural cleanse (IENC) for defending backdoor attacks. Specifically, integrated gradient (IG) is adopted to bridge the interpreters and classifiers to reverse and reconstruct the high-quality backdoor trigger. Then, an interpretation-empowered adaptative pruning strategy (IEAPS) is proposed to cleanse the backdoor-related neurons without the pre-defined threshold. Additionally, a hybrid model patching approach is employed to integrate the IEAPS and preprocessing techniques to enhance the defense performance. Comprehensive experiments are constructed on various datasets, demonstrating the potential of interpretations in defending backdoor attacks and the superiority of the proposed method.
KW - Backdoor Attacks
KW - Defense Mechanism
KW - Interpretability
UR - http://www.scopus.com/inward/record.url?scp=85194458080&partnerID=8YFLogxK
U2 - 10.1145/3589335.3651525
DO - 10.1145/3589335.3651525
M3 - Conference article published in proceeding or book
AN - SCOPUS:85194458080
T3 - WWW 2024 Companion - Companion Proceedings of the ACM Web Conference
SP - 951
EP - 954
BT - WWW 2024 Companion - Companion Proceedings of the ACM Web Conference
PB - Association for Computing Machinery, Inc
T2 - 33rd ACM Web Conference, WWW 2024
Y2 - 13 May 2024 through 17 May 2024
ER -