SCL-CVD: Supervised contrastive learning for code vulnerability detection via GraphCodeBERT

IF 4.8 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Computers & Security Pub Date : 2024-07-15 DOI:10.1016/j.cose.2024.103994

{"title":"SCL-CVD: Supervised contrastive learning for code vulnerability detection via GraphCodeBERT","authors":"","doi":"10.1016/j.cose.2024.103994","DOIUrl":null,"url":null,"abstract":"<div>Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48%<math><mo>∼</mo></math>3.42% for accuracy, 0.93%<math><mo>∼</mo></math>45.99% for precision, 35.68%<math><mo>∼</mo></math>67.48% for recall, and 16.31%<math><mo>∼</mo></math>49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67%<math><mo>∼</mo></math>93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.</div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824002992","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48% $\sim$ 3.42% for accuracy, 0.93% $\sim$ 45.99% for precision, 35.68% $\sim$ 67.48% for recall, and 16.31% $\sim$ 49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67% $\sim$ 93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SCL-CVD：通过 GraphCodeBERT 进行代码漏洞检测的有监督对比学习

检测源代码中的漏洞对于保护软件系统免受网络攻击至关重要。CodeBERT 和 GraphCodeBERT 等预先训练好的语言模型已被应用于多种代码相关的下游任务，如代码搜索和代码翻译，并取得了显著的成功。最近，这种预训练和微调范式也被应用于检测代码漏洞。然而，使用交叉熵损失对预训练语言模型进行微调有一些局限性，例如泛化性能差，对噪声标签缺乏鲁棒性。特别是当漏洞代码和良性代码非常相似时，深度学习方法很难准确区分它们。在这种情况下，我们引入了一种利用监督对比学习进行代码漏洞检测的新方法，即 SCL-CVD，它利用了 GraphCodeBERT。该方法旨在提高现有漏洞代码检测方法的有效性。SCL-CVD 将源代码表示为数据流图。然后，GraphCodeBERT 对这些图进行处理，并使用监督对比损失函数和 R-Drop 对其进行微调。这一微调过程旨在生成更具弹性和代表性的代码嵌入。此外，我们还采用了 LoRA（Low-Rank Adaptation）来简化微调过程，从而大大减少了模型训练所需的时间。最后，我们采用了多层感知器（MLP），利用学习到的代码表征来检测易受攻击的代码。我们设计了三个公共基准数据集（即 Devign、Reveal、Big-Vul），并对其进行了实验。实验结果表明，SCL-CVD 能有效提高代码漏洞检测的性能。与基线相比，所提出的方法在准确率、精确度、召回率和 F1 分数上分别有 0.48%∼3.42% 、0.93%∼45.99% 、35.68%∼67.48% 和 16.31%∼49.67% 的相对提高。此外，与基线相比，建议方法的模型微调时间减少了 16.67%∼93.03%。总之，与现有方法相比，我们的方法 SCL-CVD 具有更高的成本效益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.