Hidden code vulnerability detection: A study of the Graph-BiLSTM algorithm

IF 3.8 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Information and Software Technology Pub Date : 2024-07-30 DOI:10.1016/j.infsof.2024.107544

Kao Ge, Qing-Bang Han

{"title":"Hidden code vulnerability detection: A study of the Graph-BiLSTM algorithm","authors":"Kao Ge, Qing-Bang Han","doi":"10.1016/j.infsof.2024.107544","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><p>The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.</p></div><div><h3>Objectives:</h3><p>In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.</p></div><div><h3>Methods:</h3><p>In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.</p></div><div><h3>Conclusion:</h3><p>Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.</p></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"175 ","pages":"Article 107544"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924001496","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Context:

The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.

Objectives:

In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.

Methods:

In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.

Conclusion:

Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

隐藏代码漏洞检测：图-BiLSTM 算法研究

互联网的加速发展和人工智能的出现导致开源产品之间的相互依赖性增强，进而导致安全事件的频率上升。因此，如何经济、快速、高效地检测开源软件产品中隐藏的代码漏洞，已成为学术界和工程界面临的一项紧迫挑战。针对这一迫切需求，我们提出了一种新颖高效的代码漏洞检测模型：图形-双向长短期记忆网络算法（Graph-BiLSTM）。该算法旨在大规模、低成本、高效率地检测 Github 代码提交记录中的漏洞。为了提取最有效的代码漏洞特征，对最先进的漏洞数据集进行了比较，以确定最佳训练数据集。首先，使用 Joern 工具将函数级代码块转换为代码属性图（CPG）。之后，计算这些 CPG 的结构特征（度中心性、卡茨中心性和接近中心性），并将其与节点序列的嵌入特征相结合，形成功能级代码块的二维特征向量空间。随后，采用 BiLSTM 网络算法对大量漏洞代码样本进行自动提取和迭代模型训练。最后，将训练好的算法模型应用于 GitHub 上开源软件产品的代码提交记录，实现了对隐藏代码漏洞的有效检测。实验结果表明，PrimeVul 数据集是漏洞检测的最佳资源。此外，在检测 GitHub 上开源软件代码中的漏洞时，与最先进的算法相比，Graph-BiLSTM 模型在准确性、训练成本和推理时间方面都表现出了卓越的性能。这凸显了该模型在工程应用中的重要价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information and Software Technology 工程技术-计算机：软件工程

CiteScore

9.10

自引率

7.70%

发文量

164

审稿时长

9.6 weeks

期刊介绍： Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include: • Software management, quality and metrics, • Software processes, • Software architecture, modelling, specification, design and programming • Functional and non-functional software requirements • Software testing and verification & validation • Empirical studies of all aspects of engineering and managing software development Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information. The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.

期刊最新文献

A systematic literature review of agile software development projects Dynamic information utilization for securing Ethereum smart contracts: A literature review A software vulnerability detection method based on multi-modality with unified processing Editorial Board Fairness-aware practices from developers’ perspective: A survey