TY - JOUR
T1 - A comprehensive comparative study of clustering-based unsupervised defect prediction models
AU - Xu, Zhou
AU - Li, Li
AU - Yan, Meng
AU - Liu, Jin
AU - Luo, Xiapu
AU - Grundy, John
AU - Zhang, Yifeng
AU - Zhang, Xiaohong
N1 - Funding Information:
The authors would like to thank Jaechang Nam for providing the source code of CLA and CLAMI. This work is supported by the National Key Research and Development Project (No. 2018YFB2101200 ), the National Natural Science Foundation of China (Nos. 62002034 , 61972290 ), the Fundamental Research Funds for the Central Universities (Nos. 2020CDJQY-A021 , 2020CDCGRJ072 ), China Postdoctoral Science Foundation (No. 2020M673137 ), the Natural Science Foundation of Chongqing in China (No. cstc2020jcyj-bshX0114 ), the Hong Kong Research Grant Council Project (No. 152239/18E ). Grundy is supported by Australian Research Council Laureate Fellowship FL190100035 .
Funding Information:
The authors would like to thank Jaechang Nam for providing the source code of CLA and CLAMI. This work is supported by the National Key Research and Development Project (No. 2018YFB2101200), the National Natural Science Foundation of China (Nos. 62002034, 61972290), the Fundamental Research Funds for the Central Universities (Nos. 2020CDJQY-A021, 2020CDCGRJ072), China Postdoctoral Science Foundation (No. 2020M673137), the Natural Science Foundation of Chongqing in China (No. cstc2020jcyj-bshX0114), the Hong Kong Research Grant Council Project (No. 152239/18E). Grundy is supported by Australian Research Council Laureate FellowshipFL190100035.
Publisher Copyright:
© 2020 Elsevier Inc.
PY - 2021/2
Y1 - 2021/2
N2 - Software defect prediction recommends the most defect-prone software modules for optimization of the test resource allocation. The limitation of the extensively-studied supervised defect prediction methods is that they require labeled software modules which are not always available. An alternative solution is to apply clustering-based unsupervised models to the unlabeled defect data, called Clustering-based Unsupervised Defect Prediction (CUDP). However, there are few studies to explore the impacts of clustering-based models on defect prediction performance. In this work, we performed a large-scale empirical study on 40 unsupervised models to fill this gap. We chose an open-source dataset including 27 project versions with 3 types of features. The experimental results show that (1) different clustering-based models have significant performance differences and the performance of models in the instance-violation-score-based clustering family is obviously superior to that of models in hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering families; (2) the models in the instance-violation-score-based clustering family achieves competitive performance compared with typical supervised models; (3) the impacts of feature types on the performance of the models are related to the indicators used; and (4)the clustering-based unsupervised models do not always achieve better performance on defect data with the combination of the 3 types of features.
AB - Software defect prediction recommends the most defect-prone software modules for optimization of the test resource allocation. The limitation of the extensively-studied supervised defect prediction methods is that they require labeled software modules which are not always available. An alternative solution is to apply clustering-based unsupervised models to the unlabeled defect data, called Clustering-based Unsupervised Defect Prediction (CUDP). However, there are few studies to explore the impacts of clustering-based models on defect prediction performance. In this work, we performed a large-scale empirical study on 40 unsupervised models to fill this gap. We chose an open-source dataset including 27 project versions with 3 types of features. The experimental results show that (1) different clustering-based models have significant performance differences and the performance of models in the instance-violation-score-based clustering family is obviously superior to that of models in hierarchy-based, density-based, grid-based, sequence-based, and hybrid-based clustering families; (2) the models in the instance-violation-score-based clustering family achieves competitive performance compared with typical supervised models; (3) the impacts of feature types on the performance of the models are related to the indicators used; and (4)the clustering-based unsupervised models do not always achieve better performance on defect data with the combination of the 3 types of features.
KW - Clustering-based unsupervised models
KW - Data analytics for defect prediction
KW - Empirical study
UR - http://www.scopus.com/inward/record.url?scp=85096687812&partnerID=8YFLogxK
U2 - 10.1016/j.jss.2020.110862
DO - 10.1016/j.jss.2020.110862
M3 - Journal article
AN - SCOPUS:85096687812
SN - 0164-1212
VL - 172
SP - 1
EP - 22
JO - Journal of Systems and Software
JF - Journal of Systems and Software
M1 - 110862
ER -