TY - GEN
T1 - Cross-version defect prediction via hybrid active learning with kernel principal component analysis
AU - Xu, Zhou
AU - Liu, Jin
AU - Luo, Xiapu
AU - Zhang, Tao
PY - 2018/4/2
Y1 - 2018/4/2
N2 - As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.
AB - As defects in software modules may cause product failure and financial loss, it is critical to utilize defect prediction methods to effectively identify the potentially defective modules for a thorough inspection, especially in the early stage of software development lifecycle. For an upcoming version of a software project, it is practical to employ the historical labeled defect data of the prior versions within the same project to conduct defect prediction on the current version, i.e., Cross-Version Defect Prediction (CVDP). However, software development is a dynamic evolution process that may cause the data distribution (such as defect characteristics) to vary across versions. Furthermore, the raw features usually may not well reveal the intrinsic structure information behind the data. Therefore, it is challenging to perform effective CVDP. In this paper, we propose a two-phase CVDP framework that combines Hybrid Active Learning and Kernel PCA (HALKP) to address these two issues. In the first stage, HALKP uses a hybrid active learning method to select some informative and representative unlabeled modules from the current version for querying their labels, then merges them into the labeled modules of the prior version to form an enhanced training set. In the second stage, HALKP employs a non-linear mapping method, kernel PCA, to extract representative features by embedding the original data of two versions into a high-dimension space. We evaluate the HALKP framework on 31 versions of 10 projects with three prevalent performance indicators. The experimental results indicate that HALKP achieves encouraging results with average F-measure, g-mean and Balance of 0.480, 0.592 and 0.580, respectively and significantly outperforms nearly all baseline methods.
UR - http://www.scopus.com/inward/record.url?scp=85051016836&partnerID=8YFLogxK
U2 - 10.1109/SANER.2018.8330210
DO - 10.1109/SANER.2018.8330210
M3 - Conference article published in proceeding or book
AN - SCOPUS:85051016836
T3 - 25th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2018 - Proceedings
SP - 209
EP - 220
BT - 25th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2018
Y2 - 20 March 2018 through 23 March 2018
ER -