Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering Pub Date : 2019-04-04 DOI:10.1145/3297663.3310305

Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu

{"title":"Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures","authors":"Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu","doi":"10.1145/3297663.3310305","DOIUrl":null,"url":null,"abstract":"Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297663.3310305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

异构CPU-FPGA架构协同执行策略分析与建模

异构CPU-FPGA系统正朝着cpu和fpga之间更紧密的集成发展，以提高性能和能源效率。与此同时，可编程性也随着高级合成工具(例如，OpenCL软件开发工具包)而得到改善，这些工具允许程序员用高级编程语言表达他们的设计，并避免耗时和易出错的寄存器传输级(RTL)编程。在传统的松耦合加速器模式中，FPGA作为卸载加速器工作，其中整个内核在FPGA上运行，而CPU线程等待结果。然而，cpu和fpga的紧密集成使得细粒度协同执行成为可能，也就是说，让两个设备在相同的工作负载上并发工作。这种协同执行通过同时利用CPU线程和FPGA并发性，更好地利用了整个系统资源，从而获得更高的性能。在本文中，我们探索了使用OpenCL高级合成在cpu和fpga之间协同执行的潜力。首先，我们比较了各种协作技术(即数据分区和任务分区)，并评估了它们之间的权衡。我们观察到，选择最合适的分区策略可以将性能提高2倍。其次，我们研究了一种常见的优化技术，内核复制，在协同CPU-FPGA环境中的影响。我们表明，一般趋势是内核复制在内存带宽饱和之前会提高性能。第三，我们为应用程序开发人员在设计CPU-FPGA协作应用程序时选择不同的分区策略提供了新的见解。我们发现不同的分区策略会带来不同的权衡(例如，任务分区允许更多的内核复制，而数据分区具有更低的通信开销和更好的负载平衡)，但它们通常优于传统的CPU-FPGA系统，其中不使用协作执行策略。因此，我们主张在未来的异构CPU-FPGA系统中实现更多的集成(例如，OpenCL 2.0特性，如细粒度共享虚拟内存)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

自引率

0.00%

发文量

期刊最新文献

Performance Evaluation of Multi-Path TCP for Data Center and Cloud Workloads Cachematic - Automatic Invalidation in Application-Level Caching Systems Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Yardstick: A Benchmark for Minecraft-like Services