隐藏代码漏洞检测:图-BiLSTM 算法研究

IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information and Software Technology Pub Date : 2024-07-30 DOI:10.1016/j.infsof.2024.107544
Kao Ge, Qing-Bang Han
{"title":"隐藏代码漏洞检测:图-BiLSTM 算法研究","authors":"Kao Ge,&nbsp;Qing-Bang Han","doi":"10.1016/j.infsof.2024.107544","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><p>The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.</p></div><div><h3>Objectives:</h3><p>In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.</p></div><div><h3>Methods:</h3><p>In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.</p></div><div><h3>Conclusion:</h3><p>Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.</p></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"175 ","pages":"Article 107544"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hidden code vulnerability detection: A study of the Graph-BiLSTM algorithm\",\"authors\":\"Kao Ge,&nbsp;Qing-Bang Han\",\"doi\":\"10.1016/j.infsof.2024.107544\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><p>The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.</p></div><div><h3>Objectives:</h3><p>In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.</p></div><div><h3>Methods:</h3><p>In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.</p></div><div><h3>Conclusion:</h3><p>Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.</p></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"175 \",\"pages\":\"Article 107544\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584924001496\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924001496","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

互联网的加速发展和人工智能的出现导致开源产品之间的相互依赖性增强,进而导致安全事件的频率上升。因此,如何经济、快速、高效地检测开源软件产品中隐藏的代码漏洞,已成为学术界和工程界面临的一项紧迫挑战。针对这一迫切需求,我们提出了一种新颖高效的代码漏洞检测模型:图形-双向长短期记忆网络算法(Graph-BiLSTM)。该算法旨在大规模、低成本、高效率地检测 Github 代码提交记录中的漏洞。为了提取最有效的代码漏洞特征,对最先进的漏洞数据集进行了比较,以确定最佳训练数据集。首先,使用 Joern 工具将函数级代码块转换为代码属性图(CPG)。之后,计算这些 CPG 的结构特征(度中心性、卡茨中心性和接近中心性),并将其与节点序列的嵌入特征相结合,形成功能级代码块的二维特征向量空间。随后,采用 BiLSTM 网络算法对大量漏洞代码样本进行自动提取和迭代模型训练。最后,将训练好的算法模型应用于 GitHub 上开源软件产品的代码提交记录,实现了对隐藏代码漏洞的有效检测。实验结果表明,PrimeVul 数据集是漏洞检测的最佳资源。此外,在检测 GitHub 上开源软件代码中的漏洞时,与最先进的算法相比,Graph-BiLSTM 模型在准确性、训练成本和推理时间方面都表现出了卓越的性能。这凸显了该模型在工程应用中的重要价值。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hidden code vulnerability detection: A study of the Graph-BiLSTM algorithm

Context:

The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.

Objectives:

In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.

Methods:

In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.

Conclusion:

Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Information and Software Technology
Information and Software Technology 工程技术-计算机:软件工程
CiteScore
9.10
自引率
7.70%
发文量
164
审稿时长
9.6 weeks
期刊介绍: Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.
期刊最新文献
Editorial Board A software product line approach for developing hybrid software systems Systematic mapping study on requirements engineering for regulatory compliance of software systems Evaluating the understandability and user acceptance of Attack-Defense Trees: Original experiment and replication Who uses personas in requirements engineering: The practitioners’ perspective
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1