Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING IEEE Transactions on Software Engineering Pub Date : 2024-07-05 DOI:10.1109/TSE.2024.3423712

Partha Chakraborty;Krishna Kanth Arumugam;Mahmoud Alfadel;Meiyappan Nagappan;Shane McIntosh

{"title":"Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets","authors":"Partha Chakraborty;Krishna Kanth Arumugam;Mahmoud Alfadel;Meiyappan Nagappan;Shane McIntosh","doi":"10.1109/TSE.2024.3423712","DOIUrl":null,"url":null,"abstract":"The impact of software vulnerabilities on everyday software systems is concerning. Although deep learning-based models have been proposed for vulnerability detection, their reliability remains a significant concern. While prior evaluation of such models reports impressive recall/F1 scores of up to 99%, we find that these models underperform in practical scenarios, particularly when evaluated on the entire codebases rather than only the fixing commit. In this paper, we introduce a comprehensive dataset (\n<italic>Real-Vul</i>\n) designed to accurately represent real-world scenarios for evaluating vulnerability detection models. We evaluate DeepWukong, LineVul, ReVeal, and IVDetect vulnerability detection approaches and observe a surprisingly significant drop in performance, with precision declining by up to 95 percentage points and F1 scores dropping by up to 91 percentage points. A closer inspection reveals a substantial overlap in the embeddings generated by the models for vulnerable and uncertain samples (non-vulnerable or vulnerability not reported yet), which likely explains why we observe such a large increase in the quantity and rate of false positives. Additionally, we observe fluctuations in model performance based on vulnerability characteristics (e.g., vulnerability types and severity). For example, the studied models achieve 26 percentage points better F1 scores when vulnerabilities are related to information leaks or code injection rather than when vulnerabilities are related to path resolution or predictable return values. Our results highlight the substantial performance gap that still needs to be bridged before deep learning-based vulnerability detection is ready for deployment in practical settings. We dive deeper into why models underperform in realistic settings and our investigation revealed overfitting as a key issue. We address this by introducing an augmentation technique, potentially improving performance by up to 30%. We contribute (a) an approach to creating a dataset that future research can use to improve the practicality of model evaluation; (b) \n<italic>Real-Vul</i>\n– a comprehensive dataset that adheres to this approach; and (c) empirical evidence that the deep learning-based models struggle to perform in a real-world setting.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"50 8","pages":"2163-2177"},"PeriodicalIF":6.5000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10587162/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

The impact of software vulnerabilities on everyday software systems is concerning. Although deep learning-based models have been proposed for vulnerability detection, their reliability remains a significant concern. While prior evaluation of such models reports impressive recall/F1 scores of up to 99%, we find that these models underperform in practical scenarios, particularly when evaluated on the entire codebases rather than only the fixing commit. In this paper, we introduce a comprehensive dataset ( Real-Vul ) designed to accurately represent real-world scenarios for evaluating vulnerability detection models. We evaluate DeepWukong, LineVul, ReVeal, and IVDetect vulnerability detection approaches and observe a surprisingly significant drop in performance, with precision declining by up to 95 percentage points and F1 scores dropping by up to 91 percentage points. A closer inspection reveals a substantial overlap in the embeddings generated by the models for vulnerable and uncertain samples (non-vulnerable or vulnerability not reported yet), which likely explains why we observe such a large increase in the quantity and rate of false positives. Additionally, we observe fluctuations in model performance based on vulnerability characteristics (e.g., vulnerability types and severity). For example, the studied models achieve 26 percentage points better F1 scores when vulnerabilities are related to information leaks or code injection rather than when vulnerabilities are related to path resolution or predictable return values. Our results highlight the substantial performance gap that still needs to be bridged before deep learning-based vulnerability detection is ready for deployment in practical settings. We dive deeper into why models underperform in realistic settings and our investigation revealed overfitting as a key issue. We address this by introducing an augmentation technique, potentially improving performance by up to 30%. We contribute (a) an approach to creating a dataset that future research can use to improve the practicality of model evaluation; (b) Real-Vul – a comprehensive dataset that adheres to this approach; and (c) empirical evidence that the deep learning-based models struggle to perform in a real-world setting.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

重新审视基于深度学习的漏洞检测在现实数据集上的表现

软件漏洞对日常软件系统的影响令人担忧。虽然已经提出了基于深度学习的漏洞检测模型，但其可靠性仍然是一个重大问题。虽然之前对此类模型的评估报告显示召回率/F1 分数高达 99%，令人印象深刻，但我们发现这些模型在实际应用场景中表现不佳，尤其是在对整个代码库而非只对修复提交进行评估时。在本文中，我们引入了一个综合数据集（Real-Vul），该数据集旨在准确地代表真实世界的场景，用于评估漏洞检测模型。我们对 DeepWukong、LineVul、ReVeal 和 IVDetect 漏洞检测方法进行了评估，结果发现它们的性能下降幅度惊人，精度下降了 95 个百分点，F1 分数下降了 91 个百分点。仔细观察会发现，模型为有漏洞样本和不确定样本（无漏洞或尚未报告的漏洞）生成的嵌入结果存在大量重叠，这很可能是我们观察到误报数量和误报率大幅上升的原因。此外，我们还观察到基于漏洞特征（如漏洞类型和严重程度）的模型性能波动。例如，当漏洞与信息泄露或代码注入相关时，所研究模型的 F1 分数要比漏洞与路径解析或可预测返回值相关时高 26 个百分点。我们的研究结果凸显了基于深度学习的漏洞检测在实际环境中部署前仍需弥合的巨大性能差距。我们深入研究了模型在现实环境中表现不佳的原因，我们的调查发现过度拟合是一个关键问题。我们通过引入增强技术来解决这个问题，有可能将性能提高 30%。我们贡献了：（a）一种创建数据集的方法，未来的研究可以利用这种方法来提高模型评估的实用性；（b）Real-Vul--一种符合这种方法的综合数据集；以及（c）基于深度学习的模型在现实世界环境中表现不佳的经验证据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.

期刊最新文献

GenProgJS: a Baseline System for Test-based Automated Repair of JavaScript Programs On Inter-dataset Code Duplication and Data Leakage in Large Language Models Line-Level Defect Prediction by Capturing Code Contexts with Graph Convolutional Networks Does Treatment Adherence Impact Experiment Results in TDD? Scoping Software Engineering for AI: The TSE Perspective