汇编代码的分布式表示

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers Pub Date : 2023-11-01 DOI:10.3390/computers12110222
Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa
{"title":"汇编代码的分布式表示","authors":"Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa","doi":"10.3390/computers12110222","DOIUrl":null,"url":null,"abstract":"In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"44 3-4","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Representation for Assembly Code\",\"authors\":\"Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa\",\"doi\":\"10.3390/computers12110222\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.\",\"PeriodicalId\":46292,\"journal\":{\"name\":\"Computers\",\"volume\":\"44 3-4\",\"pages\":\"0\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/computers12110222\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12110222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

近年来,由于软件开发过程中源代码的重用和抄袭,类似的软件产品中有许多共性部件的数量不断增加。模式匹配是现有的一种相似度检测方法,但无法检测出这些软件产品与其他程序之间的相似度。例如,有必要根据功能和控制结构的共性来检测相似性。同时,详细的软件分析需要人工逆向工程。因此,提前自动识别软件产品中大量代码之间的相似性的技术可以减少这些负担。本文提出了一种表示学习模型,从静态分析得到的汇编代码中提取特征表达式,以确定软件产品之间的相似度。我们使用汇编代码来消除对源代码存在的依赖或开发语言的差异。所建议的方法利用了Asm2Vec(一种现有的方法),该方法能够生成捕获汇编代码语义的矢量表示。所提出的方法还包含了程序控制结构的信息。控制结构可以用图形数据表示。因此,我们使用图嵌入(一种图向量表示方法)来生成反映汇编代码的语义和控制结构的表示向量。在我们的实验中,我们从多个程序中生成表达向量,并使用聚类来验证该方法将相似程序分类到同一聚类中的准确性。该方法在精度和执行时间上都优于仅考虑语义的现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Distributed Representation for Assembly Code
In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computers
Computers COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-
CiteScore
5.40
自引率
3.60%
发文量
153
审稿时长
11 weeks
期刊最新文献
Advanced Road Safety: Collective Perception for Probability of Collision Estimation of Connected Vehicles Forecasting of Bitcoin Illiquidity Using High-Dimensional and Textual Features Mining Negative Associations from Medical Databases Considering Frequent, Regular, Closed and Maximal Patterns Faraway, so Close: Perceptions of the Metaverse on the Edge of Madness Blockchain-Powered Gaming: Bridging Entertainment with Serious Game Objectives
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1