Automatic Generation of Efficient Accelerators for Reconfigurable Hardware

D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun
{"title":"Automatic Generation of Efficient Accelerators for Reconfigurable Hardware","authors":"D. Koeplinger, R. Prabhakar, Yaqi Zhang, Christina Delimitrou, C. Kozyrakis, K. Olukotun","doi":"10.1145/3007787.3001150","DOIUrl":null,"url":null,"abstract":"Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"39 1","pages":"115-127"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"98","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 98

Abstract

Acceleration in the form of customized datapaths offer large performance and energy improvements over general purpose processors. Reconfigurable fabrics such as FPGAs are gaining popularity for use in implementing application-specific accelerators, thereby increasing the importance of having good high-level FPGA design tools. However, current tools for targeting FPGAs offer inadequate support for high-level programming, resource estimation, and rapid and automatic design space exploration. We describe a design framework that addresses these challenges. We introduce a new representation of hardware using parameterized templates that captures locality and parallelism information at multiple levels of nesting. This representation is designed to be automatically generated from high-level languages based on parallel patterns. We describe a hybrid area estimation technique which uses template-level models and design-level artificial neural networks to account for effects from hardware place-and-route tools, including routing overheads, register and block RAM duplication, and LUT packing. Our runtime estimation accounts for off-chip memory accesses. We use our estimation capabilities to rapidly explore a large space of designs across tile sizes, parallelization factors, and optional coarse-grained pipelining, all at multiple loop levels. We show that estimates average 4.8% error for logic resources, 6.1% error for runtimes, and are 279 to 6533 times faster than a commercial high-level synthesis tool. We compare the best-performing designs to optimized CPU code running on a server-grade 6 core processor and show speedups of up to 16.7×.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于可重构硬件的高效加速器的自动生成
自定义数据路径形式的加速提供了比通用处理器更大的性能和能耗改进。可重构结构(如FPGA)在实现特定应用的加速器中越来越受欢迎,从而增加了拥有良好的高级FPGA设计工具的重要性。然而,目前针对fpga的工具对高级编程、资源估计和快速自动设计空间探索的支持不足。我们描述了一个解决这些挑战的设计框架。我们引入了一种新的硬件表示,使用参数化模板,在多个嵌套级别捕获局部性和并行性信息。这种表示被设计成基于并行模式从高级语言自动生成。我们描述了一种混合面积估计技术,该技术使用模板级模型和设计级人工神经网络来考虑硬件放置和路由工具的影响,包括路由开销、寄存器和块RAM重复以及LUT打包。我们的运行时估计考虑了片外内存访问。我们使用我们的估计能力来快速探索跨越瓷砖大小、并行化因素和可选的粗粒度流水线的大型设计空间,所有这些都在多个循环级别上。我们表明,对逻辑资源的估计平均误差为4.8%,对运行时的估计平均误差为6.1%,并且比商业高级综合工具快279到6533倍。我们将性能最好的设计与在服务器级6核处理器上运行的优化CPU代码进行比较,并显示速度高达16.7倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RelaxFault Memory Repair Boosting Access Parallelism to PCM-Based Main Memory Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems Energy Efficient Architecture for Graph Analytics Accelerators
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1