Ruitong Liu , Yanbin Wang , Haitao Xu , Jianguo Sun , Fan Zhang , Peiyue Li , Zhenhao Guo
{"title":"Vul-LMGNNs: Fusing language models and online-distilled graph neural networks for code vulnerability detection","authors":"Ruitong Liu , Yanbin Wang , Haitao Xu , Jianguo Sun , Fan Zhang , Peiyue Li , Zhenhao Guo","doi":"10.1016/j.inffus.2024.102748","DOIUrl":null,"url":null,"abstract":"<div><div>Code Language Models (codeLMs) and Graph Neural Networks (GNNs) are widely used in code vulnerability detection. However, a critical yet often overlooked issue is that GNNs primarily rely on aggregating information from adjacent nodes, limiting structural information transfer to single-layer updates. In code graphs, nodes and relationships typically require cross-layer information propagation to fully capture complex program logic and potential vulnerability patterns. Furthermore, while some studies utilize codeLMs to supplement GNNs with code semantic information, existing integration methods have not fully explored the potential of their collaborative effects.</div><div>To address these challenges, we introduce Vul-LMGNNs that integrates pre-trained CodeLMs with GNNs, leveraging knowledge distillation to facilitate cross-layer propagation of both code semantic knowledge and structural information. Specifically, Vul-LMGNNs utilizes Code Property Graphs (CPGs) to incorporate code syntax, control flow, and data dependencies, while employing gated GNNs to extract structural information in the CPG. To achieve cross-layer information transmission, we implement an online knowledge distillation (KD) program that enables a single student GNN to acquire structural information extracted from a simultaneously trained counterpart through an alternating training procedure. Additionally, we leverage pre-trained CodeLMs to extract semantic features from code sequences. Finally, we propose an ”implicit-explicit” joint training framework to better leverage the strengths of both CodeLMs and GNNs. In the implicit phase, we utilize CodeLMs to initialize the node embeddings of each student GNN. Through online knowledge distillation, we facilitate the propagation of both code semantics and structural information across layers. In the explicit phase, we perform linear interpolation between the CodeLM and the distilled GNN to learn a late fusion model. The proposed method, evaluated across four real-world vulnerability datasets, demonstrated superior performance compared to 17 state-of-the-art approaches. Our source code can be accessed via GitHub: <span><span>https://github.com/Vul-LMGNN/vul-LMGGNN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":null,"pages":null},"PeriodicalIF":14.7000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524005268","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Code Language Models (codeLMs) and Graph Neural Networks (GNNs) are widely used in code vulnerability detection. However, a critical yet often overlooked issue is that GNNs primarily rely on aggregating information from adjacent nodes, limiting structural information transfer to single-layer updates. In code graphs, nodes and relationships typically require cross-layer information propagation to fully capture complex program logic and potential vulnerability patterns. Furthermore, while some studies utilize codeLMs to supplement GNNs with code semantic information, existing integration methods have not fully explored the potential of their collaborative effects.
To address these challenges, we introduce Vul-LMGNNs that integrates pre-trained CodeLMs with GNNs, leveraging knowledge distillation to facilitate cross-layer propagation of both code semantic knowledge and structural information. Specifically, Vul-LMGNNs utilizes Code Property Graphs (CPGs) to incorporate code syntax, control flow, and data dependencies, while employing gated GNNs to extract structural information in the CPG. To achieve cross-layer information transmission, we implement an online knowledge distillation (KD) program that enables a single student GNN to acquire structural information extracted from a simultaneously trained counterpart through an alternating training procedure. Additionally, we leverage pre-trained CodeLMs to extract semantic features from code sequences. Finally, we propose an ”implicit-explicit” joint training framework to better leverage the strengths of both CodeLMs and GNNs. In the implicit phase, we utilize CodeLMs to initialize the node embeddings of each student GNN. Through online knowledge distillation, we facilitate the propagation of both code semantics and structural information across layers. In the explicit phase, we perform linear interpolation between the CodeLM and the distilled GNN to learn a late fusion model. The proposed method, evaluated across four real-world vulnerability datasets, demonstrated superior performance compared to 17 state-of-the-art approaches. Our source code can be accessed via GitHub: https://github.com/Vul-LMGNN/vul-LMGGNN.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.