CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification

Zeqing Qin, Yiwei Wu, Lansheng Han
{"title":"CLNX: Bridging Code and Natural Language for C/C++ Vulnerability-Contributing Commits Identification","authors":"Zeqing Qin, Yiwei Wu, Lansheng Han","doi":"arxiv-2409.07407","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have shown great promise in vulnerability\nidentification. As C/C++ comprises half of the Open-Source Software (OSS)\nvulnerabilities over the past decade and updates in OSS mainly occur through\ncommits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing\nCommits (VCCs) is essential. However, current studies primarily focus on\nfurther pre-training LLMs on massive code datasets, which is resource-intensive\nand poses efficiency challenges. In this paper, we enhance the ability of\nBERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose\nCodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++\nprograms and LLMs. Based on commits, CLNX efficiently converts the source code\ninto a more natural representation while preserving key details. Specifically,\nCLNX first applies structure-level naturalization to decompose complex\nprograms, followed by token-level naturalization to interpret complex symbols.\nWe evaluate CLNX on public datasets of 25,872 C/C++ functions with their\ncommits. The results show that CLNX significantly enhances the performance of\nLLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new\nstate-of-the-art and identifies 38 OSS vulnerabilities in the real world.","PeriodicalId":501332,"journal":{"name":"arXiv - CS - Cryptography and Security","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Cryptography and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Large Language Models (LLMs) have shown great promise in vulnerability identification. As C/C++ comprises half of the Open-Source Software (OSS) vulnerabilities over the past decade and updates in OSS mainly occur through commits, enhancing LLMs' ability to identify C/C++ Vulnerability-Contributing Commits (VCCs) is essential. However, current studies primarily focus on further pre-training LLMs on massive code datasets, which is resource-intensive and poses efficiency challenges. In this paper, we enhance the ability of BERT-based LLMs to identify C/C++ VCCs in a lightweight manner. We propose CodeLinguaNexus (CLNX) as a bridge facilitating communication between C/C++ programs and LLMs. Based on commits, CLNX efficiently converts the source code into a more natural representation while preserving key details. Specifically, CLNX first applies structure-level naturalization to decompose complex programs, followed by token-level naturalization to interpret complex symbols. We evaluate CLNX on public datasets of 25,872 C/C++ functions with their commits. The results show that CLNX significantly enhances the performance of LLMs on identifying C/C++ VCCs. Moreover, CLNX-equipped CodeBERT achieves new state-of-the-art and identifies 38 OSS vulnerabilities in the real world.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CLNX:为识别 C/C++ 漏洞贡献提交架起代码与自然语言的桥梁
大型语言模型(LLMs)在漏洞识别方面显示了巨大的前景。在过去十年中,C/C++ 占据了开源软件(OSS)漏洞的半壁江山,而开源软件的更新主要是通过提交来实现的,因此提高 LLM 识别 C/C++ 漏洞提交(VCC)的能力至关重要。然而,目前的研究主要侧重于在海量代码数据集上对 LLM 进行进一步的预训练,这不仅耗费大量资源,而且在效率方面也存在挑战。在本文中,我们以轻量级的方式增强了基于 BERT 的 LLM 识别 C/C++ VCC 的能力。我们提出了代码语言联系(CodeLinguaNexus,CLNX)作为促进 C/C++ 程序与 LLM 之间交流的桥梁。基于提交,CLNX 可以高效地将源代码转换为更自然的表示形式,同时保留关键细节。具体来说,CLNX首先应用结构级归化来分解复杂程序,然后应用标记级归化来解释复杂符号。我们在包含25872个C/C++函数及其提交的公开数据集上对CLNX进行了评估。结果表明,CLNX 显著提高了LLMs 识别 C/C++ VCC 的性能。此外,配备 CLNX 的 CodeBERT 达到了最新水平,在现实世界中识别出了 38 个开放源码软件漏洞。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PAD-FT: A Lightweight Defense for Backdoor Attacks via Data Purification and Fine-Tuning Artemis: Efficient Commit-and-Prove SNARKs for zkML A Survey-Based Quantitative Analysis of Stress Factors and Their Impacts Among Cybersecurity Professionals Log2graphs: An Unsupervised Framework for Log Anomaly Detection with Efficient Feature Extraction Practical Investigation on the Distinguishability of Longa's Atomic Patterns
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1