Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories

IF 6.6 2区 计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING ACM Transactions on Software Engineering and Methodology Pub Date : 2024-03-04 DOI:10.1145/3649590
Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, Damian A. Tamburri
{"title":"Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories","authors":"Daan Hommersom, Antonino Sabetta, Bonaventura Coppola, Dario Di Nucci, Damian A. Tamburri","doi":"10.1145/3649590","DOIUrl":null,"url":null,"abstract":"<p>The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an <i>advisory record</i>\nobject containing key information about a vulnerability that is extracted from an advisory, such those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions. </p><p>We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search OSS repositories for the commits that fix known vulnerabilities.</p>","PeriodicalId":50933,"journal":{"name":"ACM Transactions on Software Engineering and Methodology","volume":"69 1","pages":""},"PeriodicalIF":6.6000,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Software Engineering and Methodology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3649590","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

The lack of comprehensive sources of accurate vulnerability data represents a critical obstacle to studying and understanding software vulnerabilities (and their corrections). In this paper, we present an approach that combines heuristics stemming from practical experience and machine-learning (ML)—specifically, natural language processing (NLP)—to address this problem. Our method consists of three phases. First, we construct an advisory record object containing key information about a vulnerability that is extracted from an advisory, such those found in the National Vulnerability Database (NVD). These advisories are expressed in natural language. Second, using heuristics, a subset of candidate fix commits is obtained from the source code repository of the affected project, by filtering out commits that can be identified as unrelated to the vulnerability at hand. Finally, for each of the remaining candidate commits, our method builds a numerical feature vector reflecting the characteristics of the commit that are relevant to predicting its match with the advisory at hand. Based on the values of these feature vectors, our method produces a ranked list of candidate fixing commits. The score attributed by the ML model to each feature is kept visible to the users, allowing them to easily interpret the predictions.

We implemented our approach and we evaluated it on an open data set, built by manual curation, that comprises 2,391 known fix commits corresponding to 1,248 public vulnerability advisories. When considering the top-10 commits in the ranked results, our implementation could successfully identify at least one fix commit for up to 84.03% of the vulnerabilities (with a fix commit on the first position for 65.06% of the vulnerabilities). Our evaluation shows that our method can reduce considerably the manual effort needed to search OSS repositories for the commits that fix known vulnerabilities.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
将漏洞公告自动映射到开源软件库中的修复提交中
缺乏全面准确的漏洞数据来源是研究和理解软件漏洞(及其修正)的一个关键障碍。在本文中,我们介绍了一种方法,它结合了源自实践经验的启发式方法和机器学习(ML)--特别是自然语言处理(NLP)--来解决这一问题。我们的方法包括三个阶段。首先,我们构建一个咨询记录对象,其中包含从咨询(如国家漏洞数据库(NVD)中找到的咨询)中提取的有关漏洞的关键信息。这些咨询用自然语言表达。其次,使用启发式方法,从受影响项目的源代码库中获取候选修复提交的子集,过滤掉与当前漏洞无关的提交。最后,对于剩余的每个候选提交,我们的方法都会建立一个数字特征向量,反映与预测其与当前咨询匹配相关的提交特征。根据这些特征向量的值,我们的方法会生成一份候选修复提交的排序列表。用户可以看到 ML 模型对每个特征的评分,从而轻松解读预测结果。我们实施了我们的方法,并在一个开放数据集上对其进行了评估,该数据集由人工整理建立,包含 2,391 个已知修复提交,与 1,248 个公开漏洞公告相对应。考虑到排序结果中的前 10 次提交,我们的方法可以成功识别出 84.03% 的漏洞的至少一次修复提交(其中 65.06% 的漏洞的修复提交排在第一位)。我们的评估结果表明,我们的方法可以大大减少人工搜索开放源码软件库中修复已知漏洞的提交所需的工作量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Software Engineering and Methodology
ACM Transactions on Software Engineering and Methodology 工程技术-计算机:软件工程
CiteScore
6.30
自引率
4.50%
发文量
164
审稿时长
>12 weeks
期刊介绍: Designing and building a large, complex software system is a tremendous challenge. ACM Transactions on Software Engineering and Methodology (TOSEM) publishes papers on all aspects of that challenge: specification, design, development and maintenance. It covers tools and methodologies, languages, data structures, and algorithms. TOSEM also reports on successful efforts, noting practical lessons that can be scaled and transferred to other projects, and often looks at applications of innovative technologies. The tone is scholarly but readable; the content is worthy of study; the presentation is effective.
期刊最新文献
Effective, Platform-Independent GUI Testing via Image Embedding and Reinforcement Learning Bitmap-Based Security Monitoring for Deeply Embedded Systems Harmonising Contributions: Exploring Diversity in Software Engineering through CQA Mining on Stack Overflow An Empirical Study on the Characteristics of Database Access Bugs in Java Applications Self-planning Code Generation with Large Language Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1