基于NLP方法的二进制和反汇编码作者识别

Inf. Comput. Pub Date : 2023-06-25 DOI:10.3390/info14070361

Aleksandr Romanov, A. Kurtukova, A. Fedotova, A. Shelupanov

{"title":"基于NLP方法的二进制和反汇编码作者识别","authors":"Aleksandr Romanov, A. Kurtukova, A. Fedotova, A. Shelupanov","doi":"10.3390/info14070361","DOIUrl":null,"url":null,"abstract":"This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges.","PeriodicalId":13622,"journal":{"name":"Inf. Comput.","volume":"14 1","pages":"361"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Authorship Identification of Binary and Disassembled Codes Using NLP Methods\",\"authors\":\"Aleksandr Romanov, A. Kurtukova, A. Fedotova, A. Shelupanov\",\"doi\":\"10.3390/info14070361\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges.\",\"PeriodicalId\":13622,\"journal\":{\"name\":\"Inf. Comput.\",\"volume\":\"14 1\",\"pages\":\"361\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Inf. Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/info14070361\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inf. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14070361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文是旨在确定源代码作者身份的系列文章的一部分。分析二进制代码是网络安全、软件开发和计算机取证的一个重要方面，特别是在识别恶意软件作者方面。任何程序都是机器代码，可以使用专门的工具对其进行反汇编，并分析其作者身份，类似于使用自然语言处理方法的自然语言文本。在本研究中，我们提出了一种集成fastText、支持向量机(SVM)和作者在先前工作中开发的混合神经网络的方法。改进后的方法使用从GitHub和Google Code Jam收集的C和c++语言编写的源代码数据集进行评估。收集到的源代码被编译成可执行程序，然后使用逆向工程工具进行反汇编。采用改进的方法对反汇编代码进行作者识别的平均准确率超过0.90。此外，该方法在源代码上进行了测试，在简单情况下达到0.96的平均精度，在复杂情况下达到0.85以上。这些结果验证了所开发方法的有效性及其在解决网络安全挑战方面的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Authorship Identification of Binary and Disassembled Codes Using NLP Methods

This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Inf. Comput.

自引率

0.00%

发文量

期刊最新文献

Traceable Constant-Size Multi-authority Credentials Pspace-Completeness of the Temporal Logic of Sub-Intervals and Suffixes Employee Productivity Assessment Using Fuzzy Inference System Correction of Threshold Determination in Rapid-Guessing Behaviour Detection Combining Classifiers for Deep Learning Mask Face Recognition