传统的特征提取和轻量级模型能提高漏洞类型识别性能吗?

Inf. Softw. Technol. Pub Date : 2023-06-26 DOI:10.48550/arXiv.2306.14726

H. Vo, Son Nguyen

{"title":"传统的特征提取和轻量级模型能提高漏洞类型识别性能吗?","authors":"H. Vo, Son Nguyen","doi":"10.48550/arXiv.2306.14726","DOIUrl":null,"url":null,"abstract":"Recent advances in automated vulnerability detection have achieved potential results in helping developers determine vulnerable components. However, after detecting vulnerabilities, investigating to fix vulnerable code is a non-trivial task. In fact, the types of vulnerability, such as buffer overflow or memory corruption, could help developers quickly understand the nature of the weaknesses and localize vulnerabilities for security analysis. In this work, we investigate the problem of vulnerability type identification (VTI). The problem is modeled as the multi-label classification task, which could be effectively addressed by\"pre-training, then fine-tuning\"framework with deep pre-trained embedding models. We evaluate the performance of the well-known and advanced pre-trained models for VTI on a large set of vulnerabilities. Surprisingly, their performance is not much better than that of the classical baseline approach with an old-fashioned bag-of-word, TF-IDF. Meanwhile, these deep neural network approaches cost much more resources and require GPU. We also introduce a lightweight independent component to refine the predictions of the baseline approach. Our idea is that the types of vulnerabilities could strongly correlate to certain code tokens (distinguishing tokens) in several crucial parts of programs. The distinguishing tokens for each vulnerability type are statistically identified based on their prevalence in the type versus the others. Our results show that the baseline approach enhanced by our component can outperform the state-of-the-art deep pre-trained approaches while retaining very high efficiency. Furthermore, the proposed component could also improve the neural network approaches by up to 92.8% in macro-average F1.","PeriodicalId":133352,"journal":{"name":"Inf. Softw. Technol.","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Can An Old Fashioned Feature Extraction and A Light-weight Model Improve Vulnerability Type Identification Performance?\",\"authors\":\"H. Vo, Son Nguyen\",\"doi\":\"10.48550/arXiv.2306.14726\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in automated vulnerability detection have achieved potential results in helping developers determine vulnerable components. However, after detecting vulnerabilities, investigating to fix vulnerable code is a non-trivial task. In fact, the types of vulnerability, such as buffer overflow or memory corruption, could help developers quickly understand the nature of the weaknesses and localize vulnerabilities for security analysis. In this work, we investigate the problem of vulnerability type identification (VTI). The problem is modeled as the multi-label classification task, which could be effectively addressed by\\\"pre-training, then fine-tuning\\\"framework with deep pre-trained embedding models. We evaluate the performance of the well-known and advanced pre-trained models for VTI on a large set of vulnerabilities. Surprisingly, their performance is not much better than that of the classical baseline approach with an old-fashioned bag-of-word, TF-IDF. Meanwhile, these deep neural network approaches cost much more resources and require GPU. We also introduce a lightweight independent component to refine the predictions of the baseline approach. Our idea is that the types of vulnerabilities could strongly correlate to certain code tokens (distinguishing tokens) in several crucial parts of programs. The distinguishing tokens for each vulnerability type are statistically identified based on their prevalence in the type versus the others. Our results show that the baseline approach enhanced by our component can outperform the state-of-the-art deep pre-trained approaches while retaining very high efficiency. Furthermore, the proposed component could also improve the neural network approaches by up to 92.8% in macro-average F1.\",\"PeriodicalId\":133352,\"journal\":{\"name\":\"Inf. Softw. Technol.\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Inf. Softw. Technol.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2306.14726\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inf. Softw. Technol.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.14726","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

自动化漏洞检测的最新进展已经在帮助开发人员确定易受攻击的组件方面取得了潜在的成果。然而，在检测到漏洞之后，调查修复漏洞代码是一项非常重要的任务。实际上，漏洞的类型，如缓冲区溢出或内存损坏，可以帮助开发人员快速了解弱点的性质，并对漏洞进行本地化，以便进行安全分析。在这项工作中，我们研究了漏洞类型识别(VTI)问题。将该问题建模为多标签分类任务，采用深度预训练嵌入模型的“先训练后微调”框架可以有效地解决该问题。我们评估了知名和先进的VTI预训练模型在大量漏洞上的性能。令人惊讶的是，它们的性能并不比使用老式词袋TF-IDF的经典基线方法好多少。同时，这些深度神经网络方法耗费更多的资源和GPU。我们还引入了一个轻量级的独立组件来改进基线方法的预测。我们的想法是，漏洞的类型可能与程序的几个关键部分中的某些代码令牌(区分令牌)密切相关。每个漏洞类型的区分令牌是根据其在该类型中的流行程度与其他类型进行统计识别的。我们的结果表明，我们的组件增强的基线方法可以在保持非常高的效率的同时优于最先进的深度预训练方法。此外，所提出的分量在宏观平均F1上也能将神经网络方法提高92.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Can An Old Fashioned Feature Extraction and A Light-weight Model Improve Vulnerability Type Identification Performance?

Recent advances in automated vulnerability detection have achieved potential results in helping developers determine vulnerable components. However, after detecting vulnerabilities, investigating to fix vulnerable code is a non-trivial task. In fact, the types of vulnerability, such as buffer overflow or memory corruption, could help developers quickly understand the nature of the weaknesses and localize vulnerabilities for security analysis. In this work, we investigate the problem of vulnerability type identification (VTI). The problem is modeled as the multi-label classification task, which could be effectively addressed by"pre-training, then fine-tuning"framework with deep pre-trained embedding models. We evaluate the performance of the well-known and advanced pre-trained models for VTI on a large set of vulnerabilities. Surprisingly, their performance is not much better than that of the classical baseline approach with an old-fashioned bag-of-word, TF-IDF. Meanwhile, these deep neural network approaches cost much more resources and require GPU. We also introduce a lightweight independent component to refine the predictions of the baseline approach. Our idea is that the types of vulnerabilities could strongly correlate to certain code tokens (distinguishing tokens) in several crucial parts of programs. The distinguishing tokens for each vulnerability type are statistically identified based on their prevalence in the type versus the others. Our results show that the baseline approach enhanced by our component can outperform the state-of-the-art deep pre-trained approaches while retaining very high efficiency. Furthermore, the proposed component could also improve the neural network approaches by up to 92.8% in macro-average F1.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Inf. Softw. Technol.

自引率

0.00%

发文量