TY - JOUR
T1 - Conservative Novelty Synthesizing Network for Malware Recognition in an Open-Set Scenario
AU - Guo, Jingcai
AU - Guo, Song
AU - Ma, Shiheng
AU - Sun, Yuxia
AU - Xu, Yuanyuan
N1 - Funding Information:
This work was supported in part by the Hong Kong RGC Research Impact Fund under Project R5060-19 and Project R5034-18, in part by the General Research Fund under Project 152221/19E and Project 15220320/20E, in part by the Collaborative Research Fund under Project C5026-18G, in part by the National Natural Science Foundation of China under Grant 61872310 and Grant 62102327, in part by the Shenzhen Science and Technology Innovation Commission under Grant R2020A045, and in part by the Guangdong Basic and Applied Basic Research Foundation under Grant 2021A1515012297.
Publisher Copyright:
© 2020 IEEE.
PY - 2021/8
Y1 - 2021/8
N2 - We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR). Previous works usually assume the malware families are known to the classifier in a close-set scenario, i.e., testing families are the subset or at most identical to training families. However, novel unknown malware families frequently emerge in real-world applications, and as such, require recognizing malware instances in an open-set scenario, i.e., some unknown families are also included in the test set, which has been rarely and nonthoroughly investigated in the cyber-security domain. One practical solution for MOSR may consider jointly classifying known and detecting unknown malware families by a single classifier (e.g., neural network) from the variance of the predicted probability distribution on known families. However, conventional well-trained classifiers usually tend to obtain overly high recognition probabilities in the outputs, especially when the instance feature distributions are similar to each other, e.g., unknown versus known malware families, and thus, dramatically degrade the recognition on novel unknown malware families. To address the problem and construct an applicable MOSR system, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families and support a more robust training of the classifier. More specifically, we build upon the generative adversarial networks to explore and obtain marginal malware instances that are close to known families while falling into mimical unknown ones to guide the classifier to lower and flatten the recognition probabilities of unknown families and relatively raise that of known ones to rectify the performance of classification and detection. A cooperative training scheme involving the classification, synthesizing and rectification are further constructed to facilitate the training and jointly improve the model performance. Moreover, we also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking a large open-set malware benchmark dataset. Experimental results on two widely used malware datasets and our MAL-100 demonstrate the effectiveness of our model compared with other representative methods.
AB - We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR). Previous works usually assume the malware families are known to the classifier in a close-set scenario, i.e., testing families are the subset or at most identical to training families. However, novel unknown malware families frequently emerge in real-world applications, and as such, require recognizing malware instances in an open-set scenario, i.e., some unknown families are also included in the test set, which has been rarely and nonthoroughly investigated in the cyber-security domain. One practical solution for MOSR may consider jointly classifying known and detecting unknown malware families by a single classifier (e.g., neural network) from the variance of the predicted probability distribution on known families. However, conventional well-trained classifiers usually tend to obtain overly high recognition probabilities in the outputs, especially when the instance feature distributions are similar to each other, e.g., unknown versus known malware families, and thus, dramatically degrade the recognition on novel unknown malware families. To address the problem and construct an applicable MOSR system, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families and support a more robust training of the classifier. More specifically, we build upon the generative adversarial networks to explore and obtain marginal malware instances that are close to known families while falling into mimical unknown ones to guide the classifier to lower and flatten the recognition probabilities of unknown families and relatively raise that of known ones to rectify the performance of classification and detection. A cooperative training scheme involving the classification, synthesizing and rectification are further constructed to facilitate the training and jointly improve the model performance. Moreover, we also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking a large open-set malware benchmark dataset. Experimental results on two widely used malware datasets and our MAL-100 demonstrate the effectiveness of our model compared with other representative methods.
KW - Classification
KW - cyber security
KW - generative model
KW - malware recognition
KW - neural networks
UR - http://www.scopus.com/inward/record.url?scp=85112667077&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2021.3099122
DO - 10.1109/TNNLS.2021.3099122
M3 - Journal article
AN - SCOPUS:85112667077
SN - 2162-237X
VL - 34
SP - 662
EP - 676
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 2
ER -