基于令牌嵌入和组合算法的源代码漏洞评级

Int. J. Semantic Comput. Pub Date : 2020-12-01 DOI:10.1142/S1793351X20500087

Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu

{"title":"基于令牌嵌入和组合算法的源代码漏洞评级","authors":"Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu","doi":"10.1142/S1793351X20500087","DOIUrl":null,"url":null,"abstract":"We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.","PeriodicalId":217956,"journal":{"name":"Int. J. Semantic Comput.","volume":"67 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Vulnerability Rating of Source Code with Token Embedding and Combinatorial Algorithms\",\"authors\":\"Joseph R. Barr, Peter Shaw, F. Abu-Khzam, Tyler Thatcher, Sheng Yu\",\"doi\":\"10.1142/S1793351X20500087\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.\",\"PeriodicalId\":217956,\"journal\":{\"name\":\"Int. J. Semantic Comput.\",\"volume\":\"67 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Semantic Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/S1793351X20500087\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Semantic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/S1793351X20500087","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

我们通过展示一种对源代码进行分类和评分以及漏洞评级的新方法，对标准Android操作系统分发版的一部分氟化物蓝牙模块的源代码进行了实证分析。我们的工作流程结合了深度学习、组合优化、启发式和机器学习。启发式和深度学习的结合用于将函数(和方法)标签嵌入到低维欧几里得空间中。由于氟化物源代码的语料库相当有限(包含大约12,000个函数)，因此直接嵌入(例如使用code2vec)是站不住脚的。为了克服数据缺乏的挑战，有必要经过字节对编码的中间步骤。随后，我们嵌入令牌，从中组装函数/方法标签的嵌入。使用长短期记忆网络(LSTM)嵌入令牌。下一步是形成一个距离矩阵，由每对向量(函数嵌入)之间的余弦组成，这反过来被解释为一个(组合)图，其顶点表示函数，而边对应于值超过某个给定阈值的条目。然后应用聚类编辑将图的顶点集划分为代表“密集图”的子集，这些子集是几乎完全的子图。最后，使用表示组件的向量和附加的启发式特征作为特征来对组件进行漏洞风险建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Vulnerability Rating of Source Code with Token Embedding and Combinatorial Algorithms

We present an empirical analysis of the source code of the Fluoride Bluetooth module, which is a part of standard Android OS distribution, by exhibiting a novel approach for classifying and scoring source code and vulnerability rating. Our workflow combines deep learning, combinatorial optimization, heuristics and machine learning. A combination of heuristics and deep learning is used to embed function (and method) labels into a low-dimensional Euclidean space. Because the corpus of the Fluoride source code is rather limited (containing approximately 12,000 functions), a straightforward embedding (using, e.g. code2vec) is untenable. To overcome the challenge of dearth of data, it is necessary to go through an intermediate step of Byte-Pair Encoding. Subsequently, we embed the tokens from which we assemble an embedding of function/method labels. Long short-term memory network (LSTM) is used to embed tokens. The next step is to form a distance matrix consisting of the cosines between every pairs of vectors (function embedding) which in turn is interpreted as a (combinatorial) graph whose vertices represent functions, and edges correspond to entries whose value exceed some given threshold. Cluster-Editing is then applied to partition the vertex set of the graph into subsets representing “dense graphs,” that are nearly complete subgraphs. Finally, the vectors representing the components, plus additional heuristic-based features are used as features to model the components for vulnerability risk.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Int. J. Semantic Comput.

自引率

0.00%

发文量