{"title":"Hidden code vulnerability detection: A study of the Graph-BiLSTM algorithm","authors":"Kao Ge, Qing-Bang Han","doi":"10.1016/j.infsof.2024.107544","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><p>The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.</p></div><div><h3>Objectives:</h3><p>In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.</p></div><div><h3>Methods:</h3><p>In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.</p></div><div><h3>Conclusion:</h3><p>Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.</p></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"175 ","pages":"Article 107544"},"PeriodicalIF":3.8000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924001496","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Context:
The accelerated growth of the Internet and the advent of artificial intelligence have led to a heightened interdependence of open source products, which has in turn resulted in a rise in the frequency of security incidents. Consequently, the cost-effective, fast and efficient detection of hidden code vulnerabilities in open source software products has become an urgent challenge for both academic and engineering communities.
Objectives:
In response to this pressing need, a novel and efficient code vulnerability detection model has been proposed: the Graph-Bi-Directional Long Short-Term Memory Network Algorithm (Graph-BiLSTM). The algorithm is designed to enable the detection of vulnerabilities in Github’s code commit records on a large scale, at low cost and in an efficient manner.
Methods:
In order to extract the most effective code vulnerability features, state-of-the-art vulnerability datasets were compared in order to identify the optimal training dataset. Initially, the Joern tool was employed to transform function-level code blocks into Code Property Graphs (CPGs). Thereafter, structural features (degree centrality, Katz centrality, and closeness centrality) of these CPGs were computed and combined with the embedding features of the node sequences to form a two-dimensional feature vector space for the function-level code blocks. Subsequently, the BiLSTM network algorithm was employed for the automated extraction and iterative model training of a substantial number of vulnerability code samples. Finally, the trained algorithmic model was applied to code commit records of open-source software products on GitHub, achieving effective detection of hidden code vulnerabilities.
Conclusion:
Experimental results indicate that the PrimeVul dataset represents the most optimal resource for vulnerability detection. Moreover, the Graph-BiLSTM model demonstrated superior performance in terms of accuracy, training cost, and inference time when compared to state-of-the-art algorithms for the detection of vulnerabilities in open-source software code on GitHub. This highlights the significant value of the model for engineering applications.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.