一种用于漏洞检测的扩展词嵌入方法基准系统

Proceedings of the 4th International Conference on Future Networks and Distributed Systems Pub Date : 2020-11-26 DOI:10.1145/3440749.3442661

H. Nguyen, Hoang Nguyen Viet, T. Uehara

{"title":"一种用于漏洞检测的扩展词嵌入方法基准系统","authors":"H. Nguyen, Hoang Nguyen Viet, T. Uehara","doi":"10.1145/3440749.3442661","DOIUrl":null,"url":null,"abstract":"Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.","PeriodicalId":344578,"journal":{"name":"Proceedings of the 4th International Conference on Future Networks and Distributed Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection\",\"authors\":\"H. Nguyen, Hoang Nguyen Viet, T. Uehara\",\"doi\":\"10.1145/3440749.3442661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.\",\"PeriodicalId\":344578,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Future Networks and Distributed Systems\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Future Networks and Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3440749.3442661\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Future Networks and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3440749.3442661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

安全研究人员已经使用自然语言处理(NLP)和深度学习技术来编程代码分析任务，如自动错误检测和漏洞预测或分类。这些研究主要是基于自然语言处理嵌入方法生成深度学习模型的输入向量。然而，尽管有许多现有的嵌入方法，神经网络的结构是多种多样的，通常是启发式的。这使得选择神经模型和嵌入技术的有效组合来训练代码漏洞检测器变得困难。为了解决这一挑战，我们扩展了一个基准系统来分析四种流行的词嵌入技术与四种不同神经网络的兼容性，包括标准的双向长短期记忆(Bi-LSTM)、Bi-LSTM应用注意机制、卷积神经网络(CNN)和经典的深度神经网络(DNN)。我们使用用C代码编写的两种易受攻击的函数数据集来训练和测试模型。结果表明，结合FastText嵌入技术的Bi-LSTM模型在真实数据集上的检测率最高，而在人工构建的数据集上的检测率则不高。我们的结果还详细讨论了与其他组合的进一步比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection

Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 4th International Conference on Future Networks and Distributed Systems

自引率

0.00%

发文量