Asm2Seq：为逆向工程和漏洞分析生成可解释的汇编代码功能摘要

Digital Threats: Research and Practice Pub Date : 2023-05-18 DOI:10.1145/3592623

Scarlett Taviss, Steven H. H. Ding, Mohammad Zulkernine, P. Charland, Sudipta Acharya

{"title":"Asm2Seq：为逆向工程和漏洞分析生成可解释的汇编代码功能摘要","authors":"Scarlett Taviss, Steven H. H. Ding, Mohammad Zulkernine, P. Charland, Sudipta Acharya","doi":"10.1145/3592623","DOIUrl":null,"url":null,"abstract":"Reverse engineering is the process of understanding the inner working of a software system without having the source code. It is critical for firmware security validation, software vulnerability research, and malware analysis. However, it often requires a significant amount of manual effort. Recently, data-driven solutions were proposed to reduce manual effort by identifying the code clones on the assembly or the source level. However, security analysts still have to understand the matched assembly or source code to develop an understanding of the functionality, and it is assumed that such a matched candidate always exists. This research bridges the gap by introducing the problem of assembly code summarization. Given the assembly code as input, we propose a machine-learning-based system that can produce human-readable summarizations of the functionalities in the context of code vulnerability analysis. We generate the first assembly code to function summary dataset and propose to leverage the encoder-decoder architecture. With the attention mechanism, it is possible to understand what aspects of the assembly code had the largest impact on generating the summary. Our experiment shows that the proposed solution achieves high accuracy and the Bilingual Evaluation Understudy (BLEU) score. Finally, we have performed case studies on real-life CVE vulnerability cases to better understand the proposed method’s performance and practical implications.","PeriodicalId":202552,"journal":{"name":"Digital Threats: Research and Practice","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Asm2Seq: Explainable Assembly Code Functional Summary Generation for Reverse Engineering and Vulnerability Analysis\",\"authors\":\"Scarlett Taviss, Steven H. H. Ding, Mohammad Zulkernine, P. Charland, Sudipta Acharya\",\"doi\":\"10.1145/3592623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reverse engineering is the process of understanding the inner working of a software system without having the source code. It is critical for firmware security validation, software vulnerability research, and malware analysis. However, it often requires a significant amount of manual effort. Recently, data-driven solutions were proposed to reduce manual effort by identifying the code clones on the assembly or the source level. However, security analysts still have to understand the matched assembly or source code to develop an understanding of the functionality, and it is assumed that such a matched candidate always exists. This research bridges the gap by introducing the problem of assembly code summarization. Given the assembly code as input, we propose a machine-learning-based system that can produce human-readable summarizations of the functionalities in the context of code vulnerability analysis. We generate the first assembly code to function summary dataset and propose to leverage the encoder-decoder architecture. With the attention mechanism, it is possible to understand what aspects of the assembly code had the largest impact on generating the summary. Our experiment shows that the proposed solution achieves high accuracy and the Bilingual Evaluation Understudy (BLEU) score. Finally, we have performed case studies on real-life CVE vulnerability cases to better understand the proposed method’s performance and practical implications.\",\"PeriodicalId\":202552,\"journal\":{\"name\":\"Digital Threats: Research and Practice\",\"volume\":\"43 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Threats: Research and Practice\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3592623\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Threats: Research and Practice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3592623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

逆向工程是在没有源代码的情况下了解软件系统内部工作原理的过程。它对于固件安全验证、软件漏洞研究和恶意软件分析至关重要。然而，逆向工程往往需要大量的人工操作。最近，人们提出了数据驱动解决方案，通过识别汇编或源代码级的代码克隆来减少人工工作量。但是，安全分析人员仍然必须了解匹配的程序集或源代码，才能对其功能有所了解，而且假设这种匹配的候选代码总是存在的。本研究通过引入汇编代码总结问题弥补了这一差距。将汇编代码作为输入，我们提出了一种基于机器学习的系统，它能在代码漏洞分析中生成人类可读的功能总结。我们生成了第一个汇编代码到功能摘要数据集，并建议利用编码器-解码器架构。通过关注机制，可以了解汇编代码的哪些方面对生成摘要的影响最大。我们的实验表明，所提出的解决方案实现了较高的准确率和双语评估（BLEU）得分。最后，我们对现实生活中的 CVE 漏洞案例进行了案例研究，以更好地了解所提方法的性能和实际意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Asm2Seq: Explainable Assembly Code Functional Summary Generation for Reverse Engineering and Vulnerability Analysis

Reverse engineering is the process of understanding the inner working of a software system without having the source code. It is critical for firmware security validation, software vulnerability research, and malware analysis. However, it often requires a significant amount of manual effort. Recently, data-driven solutions were proposed to reduce manual effort by identifying the code clones on the assembly or the source level. However, security analysts still have to understand the matched assembly or source code to develop an understanding of the functionality, and it is assumed that such a matched candidate always exists. This research bridges the gap by introducing the problem of assembly code summarization. Given the assembly code as input, we propose a machine-learning-based system that can produce human-readable summarizations of the functionalities in the context of code vulnerability analysis. We generate the first assembly code to function summary dataset and propose to leverage the encoder-decoder architecture. With the attention mechanism, it is possible to understand what aspects of the assembly code had the largest impact on generating the summary. Our experiment shows that the proposed solution achieves high accuracy and the Bilingual Evaluation Understudy (BLEU) score. Finally, we have performed case studies on real-life CVE vulnerability cases to better understand the proposed method’s performance and practical implications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Threats: Research and Practice

自引率

0.00%

发文量