Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures

Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu
{"title":"Analysis and Modeling of Collaborative Execution Strategies for Heterogeneous CPU-FPGA Architectures","authors":"Sitao Huang, Li-Wen Chang, I. E. Hajj, Simon Garcia De Gonzalo, Juan Gómez-Luna, S. R. Chalamalasetti, Mohamed El-Hadedy, D. Milojicic, O. Mutlu, Deming Chen, Wen-mei W. Hwu","doi":"10.1145/3297663.3310305","DOIUrl":null,"url":null,"abstract":"Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297663.3310305","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 25

Abstract

Heterogeneous CPU-FPGA systems are evolving towards tighter integration between CPUs and FPGAs for improved performance and energy efficiency. At the same time, programmability is also improving with High Level Synthesis tools (e.g., OpenCL Software Development Kits), which allow programmers to express their designs with high-level programming languages, and avoid time-consuming and error-prone register-transfer level (RTL) programming. In the traditional loosely-coupled accelerator mode, FPGAs work as offload accelerators, where an entire kernel runs on the FPGA while the CPU thread waits for the result. However, tighter integration of the CPUs and the FPGAs enables the possibility of fine-grained collaborative execution, i.e., having both devices working concurrently on the same workload. Such collaborative execution makes better use of the overall system resources by employing both CPU threads and FPGA concurrency, thereby achieving higher performance. In this paper, we explore the potential of collaborative execution between CPUs and FPGAs using OpenCL High Level Synthesis. First, we compare various collaborative techniques (namely, data partitioning and task partitioning), and evaluate the tradeoffs between them. We observe that choosing the most suitable partitioning strategy can improve performance by up to 2x. Second, we study the impact of a common optimization technique, kernel duplication, in a collaborative CPU-FPGA context. We show that the general trend is that kernel duplication improves performance until the memory bandwidth saturates. Third, we provide new insights that application developers can use when designing CPU-FPGA collaborative applications to choose between different partitioning strategies. We find that different partitioning strategies pose different tradeoffs (e.g., task partitioning enables more kernel duplication, while data partitioning has lower communication overhead and better load balance), but they generally outperform execution on conventional CPU-FPGA systems where no collaborative execution strategies are used. Therefore, we advocate even more integration in future heterogeneous CPU-FPGA systems (e.g., OpenCL 2.0 features, such as fine-grained shared virtual memory).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
异构CPU-FPGA架构协同执行策略分析与建模
异构CPU-FPGA系统正朝着cpu和fpga之间更紧密的集成发展,以提高性能和能源效率。与此同时,可编程性也随着高级合成工具(例如,OpenCL软件开发工具包)而得到改善,这些工具允许程序员用高级编程语言表达他们的设计,并避免耗时和易出错的寄存器传输级(RTL)编程。在传统的松耦合加速器模式中,FPGA作为卸载加速器工作,其中整个内核在FPGA上运行,而CPU线程等待结果。然而,cpu和fpga的紧密集成使得细粒度协同执行成为可能,也就是说,让两个设备在相同的工作负载上并发工作。这种协同执行通过同时利用CPU线程和FPGA并发性,更好地利用了整个系统资源,从而获得更高的性能。在本文中,我们探索了使用OpenCL高级合成在cpu和fpga之间协同执行的潜力。首先,我们比较了各种协作技术(即数据分区和任务分区),并评估了它们之间的权衡。我们观察到,选择最合适的分区策略可以将性能提高2倍。其次,我们研究了一种常见的优化技术,内核复制,在协同CPU-FPGA环境中的影响。我们表明,一般趋势是内核复制在内存带宽饱和之前会提高性能。第三,我们为应用程序开发人员在设计CPU-FPGA协作应用程序时选择不同的分区策略提供了新的见解。我们发现不同的分区策略会带来不同的权衡(例如,任务分区允许更多的内核复制,而数据分区具有更低的通信开销和更好的负载平衡),但它们通常优于传统的CPU-FPGA系统,其中不使用协作执行策略。因此,我们主张在未来的异构CPU-FPGA系统中实现更多的集成(例如,OpenCL 2.0特性,如细粒度共享虚拟内存)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Performance Evaluation of Multi-Path TCP for Data Center and Cloud Workloads Cachematic - Automatic Invalidation in Application-Level Caching Systems Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects Yardstick: A Benchmark for Minecraft-like Services
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1