A System-Level Dynamic Binary Translator Using Automatically-Learned Translation Rules

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) Pub Date : 2024-02-15 DOI:10.1109/CGO57630.2024.10444850

Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, P. Yew, Weihua Zhang

{"title":"A System-Level Dynamic Binary Translator Using Automatically-Learned Translation Rules","authors":"Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, P. Yew, Weihua Zhang","doi":"10.1109/CGO57630.2024.10444850","DOIUrl":null,"url":null,"abstract":"System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU's helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.","PeriodicalId":517814,"journal":{"name":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","volume":"144 ","pages":"423-434"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CGO57630.2024.10444850","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation. In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU's helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用自动学习翻译规则的系统级动态二进制翻译器

系统级仿真器被广泛用于系统软件的设计、调试和评估。系统级仿真器的工作原理是提供一个系统级虚拟机，该虚拟机可支持在使用相同或不同指令集架构的相同或不同本地操作系统（OS）平台上运行的客户操作系统（OS）。对于这种系统级仿真，动态二进制转换（DBT）是核心技术之一。最近提出的一种基于学习的方法，使用自动学习的翻译规则，可以显著提高 DBT 性能，并获得更高质量的翻译代码。不过，这种方法只用于用户级仿真，而非系统级仿真。在将这种方法直接应用于 QEMU 的系统级仿真时，我们发现它实际上会导致平均 5% 的意外性能下降。通过更详细地分析其主要罪魁祸首，我们发现基于学习的方法会默认使用主机寄存器来维护访客 CPU 状态，其中包括条件代码寄存器（或 FLAG 寄存器）。在需要 QEMU 参与的情况下（QEMU 也需要使用主机寄存器），如果不小心处理，在上下文切换期间和切换之间为客户机、主机和 QEMU 维护主机寄存器中的系统状态可能会造成不必要的开销。这种情况包括模拟系统级指令、地址转换和中断，这些都需要使用 QEMU 的辅助功能。为了通过基于学习的方法生成更高质量的代码来实现预期的性能提升，我们提出了几种优化技术，包括减少每次上下文切换产生的开销、所需上下文切换的次数，以及更好地调度代码以消除上下文切换。我们的实验结果表明，与 QEMU 6.1 相比，这些优化技术在 SPEC CINT2006 上的平均运行速度提高了 1.36 倍，在系统仿真模式下的实际应用中提高了 1.15 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

自引率

0.00%

发文量

期刊最新文献

PresCount: Effective Register Allocation for Bank Conflict Reduction High-Throughput, Formal-Methods-Assisted Fuzzing for LLVM CGO 2024 Organization SCHEMATIC: Compile-Time Checkpoint Placement and Memory Allocation for Intermittent Systems Representing Data Collections in an SSA Form