Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Information Fusion Pub Date : 2024-10-21 DOI:10.1016/j.inffus.2024.102748

Ruitong Liu , Yanbin Wang , Haitao Xu , Jianguo Sun , Fan Zhang , Peiyue Li , Zhenhao Guo

{"title":"Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection","authors":"Ruitong Liu , Yanbin Wang , Haitao Xu , Jianguo Sun , Fan Zhang , Peiyue Li , Zhenhao Guo","doi":"10.1016/j.inffus.2024.102748","DOIUrl":null,"url":null,"abstract":"<div><div>Code Language Models (codeLMs) and Graph Neural Networks (GNNs) are widely used in code vulnerability detection. However, a critical yet often overlooked issue is that GNNs primarily rely on aggregating information from adjacent nodes, limiting structural information transfer to single-layer updates. In code graphs, nodes and relationships typically require cross-layer information propagation to fully capture complex program logic and potential vulnerability patterns. Furthermore, while some studies utilize codeLMs to supplement GNNs with code semantic information, existing integration methods have not fully explored the potential of their collaborative effects.</div><div>To address these challenges, we introduce Vul-LMGNNs that integrates pre-trained CodeLMs with GNNs, leveraging knowledge distillation to facilitate cross-layer propagation of both code semantic knowledge and structural information. Specifically, Vul-LMGNNs utilizes Code Property Graphs (CPGs) to incorporate code syntax, control flow, and data dependencies, while employing gated GNNs to extract structural information in the CPG. To achieve cross-layer information transmission, we implement an online knowledge distillation (KD) program that enables a single student GNN to acquire structural information extracted from a simultaneously trained counterpart through an alternating training procedure. Additionally, we leverage pre-trained CodeLMs to extract semantic features from code sequences. Finally, we propose an ”implicit-explicit” joint training framework to better leverage the strengths of both CodeLMs and GNNs. In the implicit phase, we utilize CodeLMs to initialize the node embeddings of each student GNN. Through online knowledge distillation, we facilitate the propagation of both code semantics and structural information across layers. In the explicit phase, we perform linear interpolation between the CodeLM and the distilled GNN to learn a late fusion model. The proposed method, evaluated across four real-world vulnerability datasets, demonstrated superior performance compared to 17 state-of-the-art approaches. Our source code can be accessed via GitHub: <span><span>https://github.com/Vul-LMGNN/vul-LMGGNN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"115 ","pages":"Article 102748"},"PeriodicalIF":14.7000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524005268","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Code Language Models (codeLMs) and Graph Neural Networks (GNNs) are widely used in code vulnerability detection. However, a critical yet often overlooked issue is that GNNs primarily rely on aggregating information from adjacent nodes, limiting structural information transfer to single-layer updates. In code graphs, nodes and relationships typically require cross-layer information propagation to fully capture complex program logic and potential vulnerability patterns. Furthermore, while some studies utilize codeLMs to supplement GNNs with code semantic information, existing integration methods have not fully explored the potential of their collaborative effects.

To address these challenges, we introduce Vul-LMGNNs that integrates pre-trained CodeLMs with GNNs, leveraging knowledge distillation to facilitate cross-layer propagation of both code semantic knowledge and structural information. Specifically, Vul-LMGNNs utilizes Code Property Graphs (CPGs) to incorporate code syntax, control flow, and data dependencies, while employing gated GNNs to extract structural information in the CPG. To achieve cross-layer information transmission, we implement an online knowledge distillation (KD) program that enables a single student GNN to acquire structural information extracted from a simultaneously trained counterpart through an alternating training procedure. Additionally, we leverage pre-trained CodeLMs to extract semantic features from code sequences. Finally, we propose an ”implicit-explicit” joint training framework to better leverage the strengths of both CodeLMs and GNNs. In the implicit phase, we utilize CodeLMs to initialize the node embeddings of each student GNN. Through online knowledge distillation, we facilitate the propagation of both code semantics and structural information across layers. In the explicit phase, we perform linear interpolation between the CodeLM and the distilled GNN to learn a late fusion model. The proposed method, evaluated across four real-world vulnerability datasets, demonstrated superior performance compared to 17 state-of-the-art approaches. Our source code can be accessed via GitHub: https://github.com/Vul-LMGNN/vul-LMGGNN.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Vul-LMGNNs：融合语言模型和在线蒸馏图神经网络进行代码漏洞检测

代码语言模型（codeLMs）和图神经网络（GNNs）被广泛应用于代码漏洞检测。然而，一个关键但经常被忽视的问题是，图神经网络主要依赖于聚合相邻节点的信息，从而将结构信息传输限制在单层更新。而在代码图中，节点和关系通常需要跨层信息传播才能完全捕捉到复杂的程序逻辑和潜在的漏洞模式。为了应对这些挑战，我们引入了 Vul-LMGNN，它将预先训练好的 CodeLM 与 GNN 集成在一起，利用知识提炼来促进代码语义知识和结构信息的跨层传播。具体来说，Vul-LMGNNs 利用代码属性图（CPG）来整合代码语法、控制流和数据依赖关系，同时采用门控 GNNs 来提取 CPG 中的结构信息。为了实现跨层信息传输，我们实施了一个在线知识蒸馏（KD）程序，使单个学生 GNN 能够通过交替训练程序，从同时训练的对等 GNN 中获取结构信息。此外，我们还利用预先训练好的 CodeLM 从代码序列中提取语义特征。最后，我们提出了一个 "隐式-显式 "联合训练框架，以更好地发挥 CodeLM 和 GNN 的优势。在隐式阶段，我们利用 CodeLMs 来初始化每个学生 GNN 的节点嵌入。通过在线知识提炼，我们促进了代码语义和结构信息的跨层传播。在显式阶段，我们在 CodeLM 和经过提炼的 GNN 之间执行线性插值，以学习后期融合模型。通过对四个真实世界的漏洞数据集进行评估，与 17 种最先进的方法相比，所提出的方法表现出了卓越的性能。我们的源代码可通过 GitHub 访问：https://github.com/Vul-LMGNN/vul-LMGGNN。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.