Zhenzhou Tian , Yuchen Gong , Chenhao Chang , Jiaze Sun , Yanping Chen , Lingwei Chen
{"title":"关于不同来源 LLVM IR 分歧的实证研究","authors":"Zhenzhou Tian , Yuchen Gong , Chenhao Chang , Jiaze Sun , Yanping Chen , Lingwei Chen","doi":"10.1016/j.cola.2024.101289","DOIUrl":null,"url":null,"abstract":"<div><p>In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"81 ","pages":"Article 101289"},"PeriodicalIF":1.7000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An empirical study on divergence of differently-sourced LLVM IRs\",\"authors\":\"Zhenzhou Tian , Yuchen Gong , Chenhao Chang , Jiaze Sun , Yanping Chen , Lingwei Chen\",\"doi\":\"10.1016/j.cola.2024.101289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.</p></div>\",\"PeriodicalId\":48552,\"journal\":{\"name\":\"Journal of Computer Languages\",\"volume\":\"81 \",\"pages\":\"Article 101289\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Computer Languages\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590118424000327\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118424000327","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0
摘要
在解决二进制代码相似性检测问题时,许多方法都选择对某些统一的中间表示(IR)(如低级虚拟机(LLVM)IR)进行操作,以克服由于不同指令集架构(ISA)之间存在明显的形态和语法差距而引起的跨架构分析难题。然而,同一程序的 LLVM IR 会受到不同因素的影响,例如获取源,即从源代码编译或从二进制代码反汇编和提取。虽然已经探讨了编译设置对二进制代码的影响,但对不同来源的 LLVM IR 之间的具体差异仍未进行深入研究。为此,我们率先开展了一项深入的实证研究,以评估不同来源的 LLVM IR 之间的差异。相应地,我们根据这些潜在的 IR 影响因素,对包含 808431 个函数中近 9800 万条 LLVM IR 指令的大量数据集进行了分析。在此基础上,我们设计了三种代码度量标准,详细描述了 IR 样本的语法、结构和语义方面,并利用这些标准来评估不同来源的 IR 的差异。研究结果深入揭示了各种因素如何以及在多大程度上影响了 IR,为组建旨在开发强大的 LLVM IR 面向预训练模型的训练语料库提供了宝贵的指导,同时也促进了以 LLVM IR 为基础的相关程序分析研究。
An empirical study on divergence of differently-sourced LLVM IRs
In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.