Exploiting Vector Processing in Dynamic Binary Translation

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI:10.1145/3337821.3337844

Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu

{"title":"Exploiting Vector Processing in Dynamic Binary Translation","authors":"Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu","doi":"10.1145/3337821.3337844","DOIUrl":null,"url":null,"abstract":"Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337844","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自动向量化技术已经被编译器采用了几十年，以利用并行处理中的数据级并行性。然而，由于处理器体系结构不断增强新特性以提高矢量/SIMD性能，遗留应用程序二进制文件无法在现代体系结构中充分利用新的矢量/SIMD功能。例如，遗留的ARMv7二进制文件不能受益于ARMv8 SIMD双精度功能，遗留的x86二进制文件不能享受AVX-512扩展的强大功能。在本文中，我们研究了跨isa动态二进制转换(DBT)所涉及的基本问题，将非矢量化循环转换为矢量/SIMD形式，以在较新的处理器架构中实现更高的计算吞吐量。其关键思想是从这些应用程序二进制文件中恢复关键循环信息，以便在运行时执行向量化。实验结果表明，在ARMv7到armv8动态二进制转换系统中，与ARMv7本机运行的各种基准测试相比，我们的方法实现了1.42倍的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 48th International Conference on Parallel Processing

自引率

0.00%

发文量

期刊最新文献

Express Link Placement for NoC-Based Many-Core Platforms Cartesian Collective Communication Artemis A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs diBELLA: Distributed Long Read to Long Read Alignment