SCL-CVD: Supervised contrastive learning for code vulnerability detection via GraphCodeBERT

IF 4.8 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Computers & Security Pub Date : 2024-07-15 DOI:10.1016/j.cose.2024.103994
{"title":"SCL-CVD: Supervised contrastive learning for code vulnerability detection via GraphCodeBERT","authors":"","doi":"10.1016/j.cose.2024.103994","DOIUrl":null,"url":null,"abstract":"<div><p>Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48%<span><math><mo>∼</mo></math></span>3.42% for accuracy, 0.93%<span><math><mo>∼</mo></math></span>45.99% for precision, 35.68%<span><math><mo>∼</mo></math></span>67.48% for recall, and 16.31%<span><math><mo>∼</mo></math></span>49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67%<span><math><mo>∼</mo></math></span>93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.</p></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":null,"pages":null},"PeriodicalIF":4.8000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824002992","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting vulnerabilities in source code is crucial for protecting software systems from cyberattacks. Pre-trained language models such as CodeBERT and GraphCodeBERT have been applied in multiple code-related downstream tasks such as code search and code translation and have achieved notable success. Recently, this pre-trained and fine-tuned paradigm has also been applied to detect code vulnerabilities. However, fine-tuning pre-trained language models using cross-entropy loss has several limitations, such as poor generalization performance and lack of robustness to noisy labels. In particular, when the vulnerable code and the benign code are very similar, it is difficult for deep learning methods to differentiate them accurately. In this context, we introduce a novel approach for code vulnerability detection using supervised contrastive learning, namely SCL-CVD, which leverages GraphCodeBERT. This method aims to enhance the effectiveness of existing vulnerable code detection approaches. SCL-CVD represents the source code as data flow graphs. These graphs are then processed by GraphCodeBERT, which has been fine-tuned using a supervised contrastive loss function combined with R-Drop. This fine-tuning process is designed to generate more resilient and representative code embedding. Additionally, we incorporate LoRA (Low-Rank Adaptation) to streamline the fine-tuning process, significantly reducing the time required for model training. Finally, a Multilayer Perceptron (MLP) is employed to detect vulnerable code leveraging the learned representation of code. We designed and conducted experiments on three public benchmark datasets, i.e., Devign, Reveal, Big-Vul, and a combined dataset created by merging these sources. The experimental results demonstrate that SCL-CVD can effectively improve the performance of code vulnerability detection. Compared with the baselines, the proposed approach has a relative improvement of 0.48%3.42% for accuracy, 0.93%45.99% for precision, 35.68%67.48% for recall, and 16.31%49.67% for F1-score, respectively. Furthermore, compared to baselines, the model fine-tuning time of the proposed approach is reduced by 16.67%93.03%. In conclusion, our approach SCL-CVD offers significantly greater cost-effectiveness over existing approaches.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SCL-CVD:通过 GraphCodeBERT 进行代码漏洞检测的有监督对比学习
检测源代码中的漏洞对于保护软件系统免受网络攻击至关重要。CodeBERT 和 GraphCodeBERT 等预先训练好的语言模型已被应用于多种代码相关的下游任务,如代码搜索和代码翻译,并取得了显著的成功。最近,这种预训练和微调范式也被应用于检测代码漏洞。然而,使用交叉熵损失对预训练语言模型进行微调有一些局限性,例如泛化性能差,对噪声标签缺乏鲁棒性。特别是当漏洞代码和良性代码非常相似时,深度学习方法很难准确区分它们。在这种情况下,我们引入了一种利用监督对比学习进行代码漏洞检测的新方法,即 SCL-CVD,它利用了 GraphCodeBERT。该方法旨在提高现有漏洞代码检测方法的有效性。SCL-CVD 将源代码表示为数据流图。然后,GraphCodeBERT 对这些图进行处理,并使用监督对比损失函数和 R-Drop 对其进行微调。这一微调过程旨在生成更具弹性和代表性的代码嵌入。此外,我们还采用了 LoRA(Low-Rank Adaptation)来简化微调过程,从而大大减少了模型训练所需的时间。最后,我们采用了多层感知器(MLP),利用学习到的代码表征来检测易受攻击的代码。我们设计了三个公共基准数据集(即 Devign、Reveal、Big-Vul),并对其进行了实验。实验结果表明,SCL-CVD 能有效提高代码漏洞检测的性能。与基线相比,所提出的方法在准确率、精确度、召回率和 F1 分数上分别有 0.48%∼3.42% 、0.93%∼45.99% 、35.68%∼67.48% 和 16.31%∼49.67% 的相对提高。此外,与基线相比,建议方法的模型微调时间减少了 16.67%∼93.03%。总之,与现有方法相比,我们的方法 SCL-CVD 具有更高的成本效益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computers & Security
Computers & Security 工程技术-计算机:信息系统
CiteScore
12.40
自引率
7.10%
发文量
365
审稿时长
10.7 months
期刊介绍: Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.
期刊最新文献
A survey on privacy and security issues in IoT-based environments: Technologies, protection measures and future directions Practically implementing an LLM-supported collaborative vulnerability remediation process: A team-based approach An enhanced Deep-Learning empowered Threat-Hunting Framework for software-defined Internet of Things Editorial Board ReckDroid: Detecting red packet fraud in Android apps
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1