RapidStream 2.0:通过部分重构实现延迟不敏感FPGA设计的自动并行实现

IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Reconfigurable Technology and Systems Pub Date : 2023-04-26 DOI:10.1145/3593025
Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong
{"title":"RapidStream 2.0:通过部分重构实现延迟不敏感FPGA设计的自动并行实现","authors":"Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong","doi":"10.1145/3593025","DOIUrl":null,"url":null,"abstract":"FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.","PeriodicalId":49248,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems","volume":" ","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2023-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration\",\"authors\":\"Licheng Guo, P. Maidee, Yun Zhou, C. Lavin, Eddie Hung, Wuxi Li, Jason Lau, W. Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, A. Kaviani, Zhiru Zhang, J. Cong\",\"doi\":\"10.1145/3593025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.\",\"PeriodicalId\":49248,\"journal\":{\"name\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2023-04-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Reconfigurable Technology and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3593025\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3593025","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 1

摘要

fpga需要比cpu等传统计算平台更长的编译周期。在本文中,我们通过共同优化HLS编译(C-to-RTL)和后端物理实现(RTL-to-bitstream)来缩短总体编译时间。我们提出了一种基于HLS级别的流水线灵活性的分割编译方法,它允许我们为并行放置和路由划分设计。我们概述了一些技术挑战,并通过打破传统FPGA工具流不同阶段之间的传统边界并重新组织它们以实现快速的端到端编译来解决这些挑战。我们的研究产生了RapidStream,一个并行和物理集成的编译框架,它采用C/ c++中的延迟不敏感程序,并生成一个完全放置和路由的实现。我们提出了两种方法。第一种方法(RapidStream 1.0)在将不同的分区拼接在一起时解决了分区间路由冲突。当在Xilinx U250 FPGA上使用一组真实的HLS设计进行测试时,与商业现成的工具链相比,RapidStream的编译时间减少了5-7倍,频率增加了1.3倍。此外,我们提供了使用定制的开源路由器的初步结果,在性能要求较低的情况下,可以将编译时间减少到一个数量级。第二种方法(RapidStream 2.0)使用虚拟引脚防止路由冲突。在Xilinx U280 FPGA上测试,我们观察到编译时间减少了5-7倍,频率提高了1.3倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RapidStream 2.0: Automated Parallel Implementation of Latency Insensitive FPGA Designs Through Partial Reconfiguration
FPGAs require a much longer compilation cycle than conventional computing platforms like CPUs. In this paper, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows us to partition designs for parallel placement and routing. We outline a number of technical challenges and address them by breaking the conventional boundaries between different stages of the traditional FPGA tool flow and reorganizing them to achieve a fast end-to-end compilation. Our research produces RapidStream, a parallelized and physical-integrated compilation framework that takes in a latency-insensitive program in C/C++ and generates a fully placed and routed implementation. We present two approaches. The first approach (RapidStream 1.0) resolves inter-partition routing conflicts at the end when separate partitions are stitched together. When tested on the Xilinx U250 FPGA with a set of realistic HLS designs, RapidStream achieves a 5-7 × reduction in compile time and up to 1.3 × increase in frequency when compared to a commercial off-the-shelf toolchain. In addition, we provide preliminary results using a customized open-source router to reduce the compile time up to an order of magnitude in cases with lower performance requirements. The second approach (RapidStream 2.0) prevents routing conflicts using virtual pins. Testing on Xilinx U280 FPGA, we observed 5-7 × compile time reduction and 1.3 × frequency increase.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Reconfigurable Technology and Systems
ACM Transactions on Reconfigurable Technology and Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
CiteScore
4.90
自引率
8.70%
发文量
79
审稿时长
>12 weeks
期刊介绍: TRETS is the top journal focusing on research in, on, and with reconfigurable systems and on their underlying technology. The scope, rationale, and coverage by other journals are often limited to particular aspects of reconfigurable technology or reconfigurable systems. TRETS is a journal that covers reconfigurability in its own right. Topics that would be appropriate for TRETS would include all levels of reconfigurable system abstractions and all aspects of reconfigurable technology including platforms, programming environments and application successes that support these systems for computing or other applications. -The board and systems architectures of a reconfigurable platform. -Programming environments of reconfigurable systems, especially those designed for use with reconfigurable systems that will lead to increased programmer productivity. -Languages and compilers for reconfigurable systems. -Logic synthesis and related tools, as they relate to reconfigurable systems. -Applications on which success can be demonstrated. The underlying technology from which reconfigurable systems are developed. (Currently this technology is that of FPGAs, but research on the nature and use of follow-on technologies is appropriate for TRETS.) In considering whether a paper is suitable for TRETS, the foremost question should be whether reconfigurability has been essential to success. Topics such as architecture, programming languages, compilers, and environments, logic synthesis, and high performance applications are all suitable if the context is appropriate. For example, an architecture for an embedded application that happens to use FPGAs is not necessarily suitable for TRETS, but an architecture using FPGAs for which the reconfigurability of the FPGAs is an inherent part of the specifications (perhaps due to a need for re-use on multiple applications) would be appropriate for TRETS.
期刊最新文献
End-to-end codesign of Hessian-aware quantized neural networks for FPGAs DyRecMul: Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration Dynamic-ACTS - A Dynamic Graph Analytics Accelerator For HBM-Enabled FPGAs NC-Library: Expanding SystemC Capabilities for Nested reConfigurable Hardware Modelling PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1