Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu
{"title":"基于令牌嵌入和组合算法的源代码漏洞评级","authors":"Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu","doi":"10.1142/S1793351X20500087","DOIUrl":null,"url":null,"abstract":"We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.","PeriodicalId":217956,"journal":{"name":"Int. J. Semantic Comput.","volume":"67 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Vulnerability Rating of Source Code with Token Embedding and Combinatorial Algorithms\",\"authors\":\"Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu\",\"doi\":\"10.1142/S1793351X20500087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.\",\"PeriodicalId\":217956,\"journal\":{\"name\":\"Int. J. Semantic Comput.\",\"volume\":\"67 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Semantic Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/S1793351X20500087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Semantic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/S1793351X20500087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Vulnerability Rating of Source Code with Token Embedding and Combinatorial Algorithms
We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.