COMP: Compiler Optimizations for Manycore Processors

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.30

Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar

{"title":"COMP: Compiler Optimizations for Manycore Processors","authors":"Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar","doi":"10.1109/MICRO.2014.30","DOIUrl":null,"url":null,"abstract":"Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"52 1","pages":"659-671"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.30","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多核处理器的编译器优化

在多核处理器上执行的应用程序现在可以很容易地将计算转移到许多核心处理器上，例如Intel Xeon Phi协处理器。然而，调优此类卸载的应用程序以实现高性能执行需要高水平的专业知识和努力。以前的工作集中在优化卸载计算在许多核心处理器上的执行。然而，我们观察到，多核和多核处理器之间的数据传输开销，以及许多核心处理器有限的设备内存，通常会限制通过卸载计算可能获得的性能增益。在本文中，我们介绍了三种源到源的编译器优化，它们可以显著提高将计算任务转移到许多核心处理器上的应用程序的性能。第一个优化自动转换已卸载的代码以启用数据流，这使多核和多核处理器之间的数据传输与这些处理器上的计算重叠，以隐藏数据传输开销。此优化还旨在最大限度地减少许多核心处理器上的内存使用，同时实现最佳性能。第二个编译器优化重新排序计算以规范不规则的内存访问。它支持在许多核心处理器上进行数据流和因式分解，即使原始源代码中的内存访问模式是不规则的。最后，我们新的共享内存机制为在主机和许多核心处理器之间传输大型基于指针的数据结构提供了有效的支持。我们的评估表明，建议的编译器优化在12个基准测试中有9个受益。与简单卸载这些基准测试的原始并行实现相比，我们可以获得1.16 -52.21倍的速度提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量