A compiler-level intermediate representation based binary analysis and rewriting system

K. Anand, M. Smithson, Khaled Elwazeer, A. Kotha, Jim Gruen, Nathan Giles, R. Barua
{"title":"A compiler-level intermediate representation based binary analysis and rewriting system","authors":"K. Anand, M. Smithson, Khaled Elwazeer, A. Kotha, Jim Gruen, Nathan Giles, R. Barua","doi":"10.1145/2465351.2465380","DOIUrl":null,"url":null,"abstract":"This paper presents component techniques essential for converting executables to a high-level intermediate representation (IR) of an existing compiler. The compiler IR is then employed for three distinct applications: binary rewriting using the compiler's binary back-end, vulnerability detection using source-level symbolic execution, and source-code recovery using the compiler's C backend. Our techniques enable complex high-level transformations not possible in existing binary systems, address a major challenge of input-derived memory addresses in symbolic execution and are the first to enable recovery of a fully functional source-code.\n We present techniques to segment the flat address space in an executable containing undifferentiated blocks of memory. We demonstrate the inadequacy of existing variable identification methods for their promotion to symbols and present our methods for symbol promotion. We also present methods to convert the physically addressed stack in an executable (with a stack pointer) to an abstract stack (without a stack pointer). Our methods do not use symbolic, relocation, or debug information since these are usually absent in deployed executables.\n We have integrated our techniques with a prototype x86 binary framework called SecondWrite that uses LLVM as IR. The robustness of the framework is demonstrated by handling executables totaling more than a million lines of source-code, produced by two different compilers (gcc and Microsoft Visual Studio compiler), three languages (C, C++, and Fortran), two operating systems (Windows and Linux) and a real world program (Apache server).","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"13 1","pages":"295-308"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh European Conference on Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2465351.2465380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 89

Abstract

This paper presents component techniques essential for converting executables to a high-level intermediate representation (IR) of an existing compiler. The compiler IR is then employed for three distinct applications: binary rewriting using the compiler's binary back-end, vulnerability detection using source-level symbolic execution, and source-code recovery using the compiler's C backend. Our techniques enable complex high-level transformations not possible in existing binary systems, address a major challenge of input-derived memory addresses in symbolic execution and are the first to enable recovery of a fully functional source-code. We present techniques to segment the flat address space in an executable containing undifferentiated blocks of memory. We demonstrate the inadequacy of existing variable identification methods for their promotion to symbols and present our methods for symbol promotion. We also present methods to convert the physically addressed stack in an executable (with a stack pointer) to an abstract stack (without a stack pointer). Our methods do not use symbolic, relocation, or debug information since these are usually absent in deployed executables. We have integrated our techniques with a prototype x86 binary framework called SecondWrite that uses LLVM as IR. The robustness of the framework is demonstrated by handling executables totaling more than a million lines of source-code, produced by two different compilers (gcc and Microsoft Visual Studio compiler), three languages (C, C++, and Fortran), two operating systems (Windows and Linux) and a real world program (Apache server).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个基于编译器级中间表示的二进制分析和重写系统
本文介绍了将可执行文件转换为现有编译器的高级中间表示(IR)所必需的组件技术。然后将编译器IR用于三个不同的应用程序:使用编译器的二进制后端进行二进制重写,使用源代码级符号执行进行漏洞检测,以及使用编译器的C后端进行源代码恢复。我们的技术实现了现有二进制系统中不可能实现的复杂高级转换,解决了符号执行中输入派生内存地址的主要挑战,并且是第一个实现完整功能源代码恢复的技术。我们提出了在包含未分化内存块的可执行文件中分割平面地址空间的技术。我们论证了现有变量识别方法在符号推广方面的不足,并提出了我们的符号推广方法。我们还提供了将可执行文件(有堆栈指针)中的物理寻址堆栈转换为抽象堆栈(没有堆栈指针)的方法。我们的方法不使用符号、重定位或调试信息,因为在部署的可执行文件中通常没有这些信息。我们将我们的技术与一个名为SecondWrite的原型x86二进制框架集成在一起,该框架使用LLVM作为IR。该框架的健壮性通过处理由两种不同的编译器(gcc和Microsoft Visual Studio编译器)、三种语言(C、c++和Fortran)、两种操作系统(Windows和Linux)和一个真实世界的程序(Apache服务器)生成的总计超过一百万行源代码的可执行文件得到了证明。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EuroSys '22: Seventeenth European Conference on Computer Systems, Rennes, France, April 5 - 8, 2022 EuroSys '21: Sixteenth European Conference on Computer Systems, Online Event, United Kingdom, April 26-28, 2021 EuroSys '20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020 STRADS: a distributed framework for scheduled model parallel machine learning NChecker: saving mobile app developers from network disruptions
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1