首页 > 最新文献

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight Flow Control 基于轻量级流量控制的fpga蝶形胖树noc增强
G. Malik, Nachiket Kapre
FPGA overlay networks-on-chip (NoCs) based on Butterfly Fat Tree (BFT) topology and lightweight flow control can outperform state-of-the-art FPGA NoCs, such as Hoplite and others, on metrics such as throughput, latency, cost and power efficiency, and features such as in-order delivery and bounded packet delivery times. On one hand, lightweight FPGA NoCs built on the principle of bufferless deflection routing, such as Hoplite, can deliver low-LUT-cost implementations but sacrifice crucial features such as in-order delivery, livelock freedom, and bounds on delivery times. On the other hand, capable conventional NoCs like CONNECT provide these features but are significantly more expensive in LUT cost. Butterfly Fat Trees with lightweight flow control can deliver these features at medium cost while providing bandwidth configuration flexibility to the developer. We design FPGA-friendly routers with (1) latency-insensitive interfaces, coupled with (2) deterministic routing policy, and (3) round-robin scheduling at NoC ports to develop switches that take 311-375 LUTs/router. We evaluate our NoC under various conditions including synthetic and real-world workloads to deliver resource-proportional throughput and latency wins over competing NoCs, while significantly improving dynamic power consumption when compared to deflection-routed NoCs. We also explore the bandwidth customizability of the BFT organization to identify best NoC configurations for resource-constrained and application-requirement constrained scenarios.
基于Butterfly Fat Tree (BFT)拓扑和轻量级流控制的FPGA覆盖片上网络(noc)在吞吐量、延迟、成本和功率效率等指标以及按顺序交付和有界数据包交付时间等特性上优于最先进的FPGA noc,如Hoplite等。一方面,基于无缓冲偏转路由原理构建的轻量级FPGA noc,如Hoplite,可以提供低成本的实现,但牺牲了诸如按顺序交付、动态锁自由和交付时间限制等关键特性。另一方面,像CONNECT这样有能力的传统noc提供这些功能,但在LUT成本上要昂贵得多。具有轻量级流量控制的Butterfly Fat Trees可以以中等成本提供这些功能,同时为开发人员提供带宽配置的灵活性。我们设计了fpga友好型路由器,具有(1)延迟不敏感接口,加上(2)确定性路由策略,以及(3)NoC端口的轮循调度,以开发占用311-375 LUTs/路由器的交换机。我们在各种条件下评估我们的NoC,包括合成和实际工作负载,以提供资源比例的吞吐量和延迟胜过竞争的NoC,同时与偏转路由NoC相比,显著提高动态功耗。我们还探讨了BFT组织的带宽可定制性,以确定资源受限和应用程序需求受限场景下的最佳NoC配置。
{"title":"Enhancing Butterfly Fat Tree NoCs for FPGAs with Lightweight Flow Control","authors":"G. Malik, Nachiket Kapre","doi":"10.1145/3289602.3294002","DOIUrl":"https://doi.org/10.1145/3289602.3294002","url":null,"abstract":"FPGA overlay networks-on-chip (NoCs) based on Butterfly Fat Tree (BFT) topology and lightweight flow control can outperform state-of-the-art FPGA NoCs, such as Hoplite and others, on metrics such as throughput, latency, cost and power efficiency, and features such as in-order delivery and bounded packet delivery times. On one hand, lightweight FPGA NoCs built on the principle of bufferless deflection routing, such as Hoplite, can deliver low-LUT-cost implementations but sacrifice crucial features such as in-order delivery, livelock freedom, and bounds on delivery times. On the other hand, capable conventional NoCs like CONNECT provide these features but are significantly more expensive in LUT cost. Butterfly Fat Trees with lightweight flow control can deliver these features at medium cost while providing bandwidth configuration flexibility to the developer. We design FPGA-friendly routers with (1) latency-insensitive interfaces, coupled with (2) deterministic routing policy, and (3) round-robin scheduling at NoC ports to develop switches that take 311-375 LUTs/router. We evaluate our NoC under various conditions including synthetic and real-world workloads to deliver resource-proportional throughput and latency wins over competing NoCs, while significantly improving dynamic power consumption when compared to deflection-routed NoCs. We also explore the bandwidth customizability of the BFT organization to identify best NoC configurations for resource-constrained and application-requirement constrained scenarios.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133851828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Generic Connectivity-Based CGRA Mapping via Integer Linear Programming 基于整数线性规划的通用连通CGRA映射
Matthew James Peter Walker, J. Anderson
Coarse-grained reconfigurable architectures (CGRAs) are programmable logic devices with large coarsegrained ALU-like logic blocks, and multi-bit datapath-style routing. CGRAs often have relatively restricted data routing networks, so they attract CAD mapping tools that use exact methods, such as Integer Linear Programming (ILP). However, tools that target general architectures must use large constraint systems to fully describe an architecture's flexibility, resulting in lengthy run-times. In this paper, we propose to derive connectivity information from an otherwise generic device model, and use this to create simpler ILPs, which we combine in an iterative schedule and retain most of the exactness of a fully-generic ILP approach. This new approach has a speed-up geometric mean of 5.88x when considering benchmarks that do not hita time-limit of 7.5 hours on the fully-generic ILP, and 37.6x otherwise. This was measured using the set of benchmarks used to originally evaluate the fully-generic approach and several more benchmarks representing computation tasks, over three different CGRA architectures. All run-times of the new approach are less than 20 minutes, with 90th percentile time of 410 seconds. The proposed mapping techniques are integrated into, and evaluated using the open-source CGRA-ME architecture modelling and exploration framework.
粗粒度可重构体系结构(CGRAs)是可编程逻辑器件,具有大型粗粒度类似alu的逻辑块和多位数据路径风格的路由。CGRAs通常具有相对受限的数据路由网络,因此它们吸引了使用精确方法的CAD映射工具,例如整数线性规划(ILP)。然而,针对一般体系结构的工具必须使用大型约束系统来完全描述体系结构的灵活性,从而导致冗长的运行时间。在本文中,我们建议从一个通用的设备模型中获得连接信息,并使用它来创建更简单的ILP,我们将其组合在一个迭代的时间表中,并保留了全通用ILP方法的大部分准确性。当考虑到在全通用ILP上没有7.5小时时间限制的基准测试时,这种新方法的几何平均加速速度为5.88倍,否则为37.6倍。这是通过在三种不同的CGRA体系结构上使用最初用于评估全通用方法的基准集和代表计算任务的其他几个基准来测量的。新方法的所有运行时间都小于20分钟,第90百分位时间为410秒。将提出的映射技术集成到开源的CGRA-ME架构建模和探索框架中,并使用该框架对其进行评估。
{"title":"Generic Connectivity-Based CGRA Mapping via Integer Linear Programming","authors":"Matthew James Peter Walker, J. Anderson","doi":"10.1109/FCCM.2019.00019","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00019","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) are programmable logic devices with large coarsegrained ALU-like logic blocks, and multi-bit datapath-style routing. CGRAs often have relatively restricted data routing networks, so they attract CAD mapping tools that use exact methods, such as Integer Linear Programming (ILP). However, tools that target general architectures must use large constraint systems to fully describe an architecture's flexibility, resulting in lengthy run-times. In this paper, we propose to derive connectivity information from an otherwise generic device model, and use this to create simpler ILPs, which we combine in an iterative schedule and retain most of the exactness of a fully-generic ILP approach. This new approach has a speed-up geometric mean of 5.88x when considering benchmarks that do not hita time-limit of 7.5 hours on the fully-generic ILP, and 37.6x otherwise. This was measured using the set of benchmarks used to originally evaluate the fully-generic approach and several more benchmarks representing computation tasks, over three different CGRA architectures. All run-times of the new approach are less than 20 minutes, with 90th percentile time of 410 seconds. The proposed mapping techniques are integrated into, and evaluated using the open-source CGRA-ME architecture modelling and exploration framework.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130292197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning 基于深度强化学习的HLS编译器相位排序
Qijing Huang, Ameer Haj-Ali, William S. Moses, J. Xiang, I. Stoica, K. Asanović, J. Wawrzynek
The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order–often referred to as the phase-ordering problem–is an NP-hard problem. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. We implement a framework in the context of the LLVM compiler to optimize the ordering for HLS programs and compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag.
编译器生成的代码的性能取决于应用优化传递的顺序。在高级合成中,生成电路的质量与前端编译器生成的代码直接相关。选择一个好的顺序——通常被称为相排序问题——是一个np困难问题。在本文中,我们评估了一种解决相排序问题的新技术:深度强化学习。我们在LLVM编译器的上下文中实现了一个框架来优化HLS程序的排序,并将深度强化学习的性能与解决相排序问题的最先进算法进行了比较。总体而言,我们的框架运行速度比这些算法快一到两个数量级,并且比-O3编译器标志提高了16%的电路性能。
{"title":"AutoPhase: Compiler Phase-Ordering for HLS with Deep Reinforcement Learning","authors":"Qijing Huang, Ameer Haj-Ali, William S. Moses, J. Xiang, I. Stoica, K. Asanović, J. Wawrzynek","doi":"10.1109/FCCM.2019.00049","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00049","url":null,"abstract":"The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Choosing a good order–often referred to as the phase-ordering problem–is an NP-hard problem. In this paper, we evaluate a new technique to address the phase-ordering problem: deep reinforcement learning. We implement a framework in the context of the LLVM compiler to optimize the ordering for HLS programs and compare the performance of deep reinforcement learning to state-of-the-art algorithms that address the phase-ordering problem. Overall, our framework runs one to two orders of magnitude faster than these algorithms, and achieves a 16% improvement in circuit performance over the -O3 compiler flag.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134129295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1