{"title":"VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching","authors":"Arastoo Zibaeirad, Marco Vieira","doi":"arxiv-2409.10756","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have shown promise in tasks like code\ntranslation, prompting interest in their potential for automating software\nvulnerability detection (SVD) and patching (SVP). To further research in this\narea, establishing a benchmark is essential for evaluating the strengths and\nlimitations of LLMs in these tasks. Despite their capabilities, questions\nremain regarding whether LLMs can accurately analyze complex vulnerabilities\nand generate appropriate patches. This paper introduces VulnLLMEval, a\nframework designed to assess the performance of LLMs in identifying and\npatching vulnerabilities in C code. Our study includes 307 real-world\nvulnerabilities extracted from the Linux kernel, creating a well-curated\ndataset that includes both vulnerable and patched code. This dataset, based on\nreal-world code, provides a diverse and representative testbed for evaluating\nLLM performance in SVD and SVP tasks, offering a robust foundation for rigorous\nassessment. Our results reveal that LLMs often struggle with distinguishing\nbetween vulnerable and patched code. Furthermore, in SVP tasks, these models\ntend to oversimplify the code, producing solutions that may not be directly\nusable without further refinement.","PeriodicalId":501278,"journal":{"name":"arXiv - CS - Software Engineering","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) have shown promise in tasks like code
translation, prompting interest in their potential for automating software
vulnerability detection (SVD) and patching (SVP). To further research in this
area, establishing a benchmark is essential for evaluating the strengths and
limitations of LLMs in these tasks. Despite their capabilities, questions
remain regarding whether LLMs can accurately analyze complex vulnerabilities
and generate appropriate patches. This paper introduces VulnLLMEval, a
framework designed to assess the performance of LLMs in identifying and
patching vulnerabilities in C code. Our study includes 307 real-world
vulnerabilities extracted from the Linux kernel, creating a well-curated
dataset that includes both vulnerable and patched code. This dataset, based on
real-world code, provides a diverse and representative testbed for evaluating
LLM performance in SVD and SVP tasks, offering a robust foundation for rigorous
assessment. Our results reveal that LLMs often struggle with distinguishing
between vulnerable and patched code. Furthermore, in SVP tasks, these models
tend to oversimplify the code, producing solutions that may not be directly
usable without further refinement.