{"title":"一种用于漏洞检测的扩展词嵌入方法基准系统","authors":"H. Nguyen, Hoang Nguyen Viet, T. Uehara","doi":"10.1145/3440749.3442661","DOIUrl":null,"url":null,"abstract":"Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.","PeriodicalId":344578,"journal":{"name":"Proceedings of the 4th International Conference on Future Networks and Distributed Systems","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection\",\"authors\":\"H. Nguyen, Hoang Nguyen Viet, T. Uehara\",\"doi\":\"10.1145/3440749.3442661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.\",\"PeriodicalId\":344578,\"journal\":{\"name\":\"Proceedings of the 4th International Conference on Future Networks and Distributed Systems\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 4th International Conference on Future Networks and Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3440749.3442661\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Future Networks and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3440749.3442661","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection
Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.