PyTraceBugs: A Large Python Code Dataset for Supervised Machine Learning in Software Defect Prediction

2021 28th Asia-Pacific Software Engineering Conference (APSEC) Pub Date : 2021-12-01 DOI:10.1109/APSEC53868.2021.00022

E. Akimova, A. Bersenev, Artem A. Deikov, Konstantin S. Kobylkin, A. Konygin, I. Mezentsev, V. Misilov

{"title":"PyTraceBugs: A Large Python Code Dataset for Supervised Machine Learning in Software Defect Prediction","authors":"E. Akimova, A. Bersenev, Artem A. Deikov, Konstantin S. Kobylkin, A. Konygin, I. Mezentsev, V. Misilov","doi":"10.1109/APSEC53868.2021.00022","DOIUrl":null,"url":null,"abstract":"Contemporary software engineering tools employ deep learning methods to identify bugs and defects in source code. Being data-hungry, supervised deep neural network models require large labeled datasets for their robust and accurate training. In distinction to, say, Java, there is lack of such datasets for Python. Most of the known datasets containing the labeled Python source code are of relatively small size. Those datasets are suitable for testing built deep learning models, but not for their training. Therefore, larger labeled datasets have to be created based on some well-received algorithmic principles to select relevant source code from the available public codebases. In this work, a large dataset of the labeled Python source code is created named PyTraceBugs. It is intended for training, validating, and evaluating large deep learning models to identify a special class of low-level bugs in source code snippets manifested by throwing error exceptions, reported in standard traceback messages. Here, a code snippet is assumed to be either a function or a method implementation. The dataset contains 5.7 million correct source code snippets and 24 thousands buggy snippets from the Github public repositories. Most represented bugs are: absence of attribute, empty object, index out of range, and text encoding/decoding errors. The dataset is split into training, validation and test samples. Confidence in labeling of the snippets into buggy and correct is about 85% according to our estimates. Labeling of the snippets in the test sample is additionally manually validated to be almost 100% confident. To demonstrate advantages of our dataset, it is used to train a binary classification model for distinguishing the buggy and correct source code. This model employs the pretrained BERT-like contextual embeddings. Its performances are as follows: precision on the test set is 96 % for the buggy source code and 61 % for the correct source code whereas recall is 34 % and 99 % respectively. The model performance is also estimated on the known BugsInPy dataset: here, it reports approximately 14% of buggy snippets.","PeriodicalId":143800,"journal":{"name":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 28th Asia-Pacific Software Engineering Conference (APSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/APSEC53868.2021.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Contemporary software engineering tools employ deep learning methods to identify bugs and defects in source code. Being data-hungry, supervised deep neural network models require large labeled datasets for their robust and accurate training. In distinction to, say, Java, there is lack of such datasets for Python. Most of the known datasets containing the labeled Python source code are of relatively small size. Those datasets are suitable for testing built deep learning models, but not for their training. Therefore, larger labeled datasets have to be created based on some well-received algorithmic principles to select relevant source code from the available public codebases. In this work, a large dataset of the labeled Python source code is created named PyTraceBugs. It is intended for training, validating, and evaluating large deep learning models to identify a special class of low-level bugs in source code snippets manifested by throwing error exceptions, reported in standard traceback messages. Here, a code snippet is assumed to be either a function or a method implementation. The dataset contains 5.7 million correct source code snippets and 24 thousands buggy snippets from the Github public repositories. Most represented bugs are: absence of attribute, empty object, index out of range, and text encoding/decoding errors. The dataset is split into training, validation and test samples. Confidence in labeling of the snippets into buggy and correct is about 85% according to our estimates. Labeling of the snippets in the test sample is additionally manually validated to be almost 100% confident. To demonstrate advantages of our dataset, it is used to train a binary classification model for distinguishing the buggy and correct source code. This model employs the pretrained BERT-like contextual embeddings. Its performances are as follows: precision on the test set is 96 % for the buggy source code and 61 % for the correct source code whereas recall is 34 % and 99 % respectively. The model performance is also estimated on the known BugsInPy dataset: here, it reports approximately 14% of buggy snippets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PyTraceBugs:用于软件缺陷预测中监督机器学习的大型Python代码数据集

当代软件工程工具使用深度学习方法来识别源代码中的错误和缺陷。由于需要大量数据，监督深度神经网络模型需要大量标记数据集才能进行鲁棒性和准确性的训练。与Java不同的是，Python缺乏这样的数据集。大多数包含标记Python源代码的已知数据集都相对较小。这些数据集适合测试已构建的深度学习模型，但不适用于它们的训练。因此，必须根据一些广为接受的算法原则创建更大的标记数据集，以便从可用的公共代码库中选择相关的源代码。在这项工作中，创建了一个名为PyTraceBugs的标记Python源代码的大型数据集。它旨在训练、验证和评估大型深度学习模型，以识别通过抛出错误异常(在标准回溯消息中报告)来表现的源代码片段中的一类特殊的低级错误。这里，假设代码片段是函数或方法实现。该数据集包含570万个正确的源代码片段和来自Github公共存储库的2.4万个错误片段。最常见的bug是:属性缺失、空对象、索引超出范围以及文本编码/解码错误。数据集分为训练样本、验证样本和测试样本。根据我们的估计，将代码片段标记为错误和正确的置信度约为85%。另外，测试样本中片段的标记是手动验证的，几乎是100%的自信。为了展示我们的数据集的优势，我们使用它来训练一个二元分类模型来区分错误和正确的源代码。该模型采用了预训练的类bert上下文嵌入。它的性能如下:在测试集上，对有缺陷的源代码的准确率为96%，对正确的源代码的准确率为61%，而召回率分别为34%和99%。模型的性能也在已知的BugsInPy数据集上进行了估计:在这里，它报告了大约14%的错误片段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 28th Asia-Pacific Software Engineering Conference (APSEC)

自引率

0.00%

发文量

期刊最新文献

Verification Assisted Gas Reduction for Smart Contracts Effective Bug Triage Based on a Hybrid Neural Network Learn To Align: A Code Alignment Network For Code Clone Detection Framework for Recommending Data Residency Compliant Application Architecture Degree doesn't Matter: Identifying the Drivers of Interaction in Software Development Ecosystems