TY - GEN
T1 - Cross version defect prediction with representative data via sparse subset selection
AU - Xu, Zhou
AU - Li, Shuai
AU - Tang, Yutian
AU - Luo, Xiapu
AU - Zhang, Tao
AU - Liu, Jin
AU - Xu, Jun
PY - 2018/5/28
Y1 - 2018/5/28
N2 - Software defect prediction aims at detecting the defect-prone software modules by mining historical development data from software repositories. If such modules are identified at the early stage of the development, it can save large amounts of resources. Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules of the current version. However, software development is a constantly-evolving process which leads to the data distribution differences across versions within the same project. The distribution differences will degrade the performance of the classification model. In this paper, we approach this issue by leveraging a state-of-the-art Dissimilarity-based Sparse Subset Selection (DS3) method. This method selects a representative module subset from the prior version based on the pairwise dissimilarities between the modules of two versions and assigns each module of the current version to one of the representative modules. These selected modules can well represent the modules of the current version, thus mitigating the distribution differences. We evaluate the effectiveness of DS3 for CVDP performance on total 40 cross-version pairs from 56 versions of 15 projects with three traditional and two effort-aware indicators. The extensive experiments show that DS3 outperforms three baseline methods, especially in terms of two effort-aware indicators.
AB - Software defect prediction aims at detecting the defect-prone software modules by mining historical development data from software repositories. If such modules are identified at the early stage of the development, it can save large amounts of resources. Cross Version Defect Prediction (CVDP) is a practical scenario by training the classification model on the historical data of the prior version and then predicting the defect labels of modules of the current version. However, software development is a constantly-evolving process which leads to the data distribution differences across versions within the same project. The distribution differences will degrade the performance of the classification model. In this paper, we approach this issue by leveraging a state-of-the-art Dissimilarity-based Sparse Subset Selection (DS3) method. This method selects a representative module subset from the prior version based on the pairwise dissimilarities between the modules of two versions and assigns each module of the current version to one of the representative modules. These selected modules can well represent the modules of the current version, thus mitigating the distribution differences. We evaluate the effectiveness of DS3 for CVDP performance on total 40 cross-version pairs from 56 versions of 15 projects with three traditional and two effort-aware indicators. The extensive experiments show that DS3 outperforms three baseline methods, especially in terms of two effort-aware indicators.
KW - cross version defect prediction
KW - pairwise dissimilarities
KW - representative data
KW - sparse subset selection
UR - http://www.scopus.com/inward/record.url?scp=85051631927&partnerID=8YFLogxK
U2 - 10.1145/3196321.3196331
DO - 10.1145/3196321.3196331
M3 - Conference article published in proceeding or book
AN - SCOPUS:85051631927
SN - 9781450357142
T3 - Proceedings - International Conference on Software Engineering
SP - 132
EP - 143
BT - Proceedings - 2018 ACM/IEEE 26th International Conference on Program Comprehension, ICPC 2018
PB - IEEE Computer Society
T2 - ACM/IEEE 26th International Conference on Program Comprehension, ICPC 2018, collocated with the 40th International Conference on Software Engineering, ICSE 2018
Y2 - 27 May 2018 through 28 May 2018
ER -