{"title":"A GitHub-Based Data Collection Method for Software Defect Prediction","authors":"Jiaxi Xu, Liang Yan, Fei Wang, J. Ai","doi":"10.1109/DSA.2019.00020","DOIUrl":null,"url":null,"abstract":"With the increasing scale and complexity of software systems, the defects of software are increasing every day. Software defect data is the foundation of research and application of software reliability. Currently, the lack of software defect data, its insufficient coverage, and the limits of the software types involved have become the bottleneck of software reliability research and application. Starting from GitHub, the open-source software hosting platform, this paper analyzes software defect data in open source projects and classifies the available software data. Based on the research of the GitHub and Git repository, we propose a defect data acquisition technology based on open-source software that uses pull requests as the breakthrough point of the method. Moreover, we advanced a software defect data preliminary treatment and built a software defect big datasets collecting system that contains fix-inducing change and contextual information of defects, which solves the class imbalance problem. According to this method, a software defect big data automatic acquisition platform based on GitHub was developed to realize the automatic collection of software defect data. Finally, the efficiency of data collection, correctness of data, and validity of the dataset application were verified by experiments. The results show that the proposed method is efficient and effective.","PeriodicalId":342719,"journal":{"name":"2019 6th International Conference on Dependable Systems and Their Applications (DSA)","volume":"256 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 6th International Conference on Dependable Systems and Their Applications (DSA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSA.2019.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
With the increasing scale and complexity of software systems, the defects of software are increasing every day. Software defect data is the foundation of research and application of software reliability. Currently, the lack of software defect data, its insufficient coverage, and the limits of the software types involved have become the bottleneck of software reliability research and application. Starting from GitHub, the open-source software hosting platform, this paper analyzes software defect data in open source projects and classifies the available software data. Based on the research of the GitHub and Git repository, we propose a defect data acquisition technology based on open-source software that uses pull requests as the breakthrough point of the method. Moreover, we advanced a software defect data preliminary treatment and built a software defect big datasets collecting system that contains fix-inducing change and contextual information of defects, which solves the class imbalance problem. According to this method, a software defect big data automatic acquisition platform based on GitHub was developed to realize the automatic collection of software defect data. Finally, the efficiency of data collection, correctness of data, and validity of the dataset application were verified by experiments. The results show that the proposed method is efficient and effective.