首页 > 最新文献

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文 中文
Secure Function Evaluation Using an FPGA Overlay Architecture 基于FPGA覆盖架构的安全功能评估
Xin Fang, Stratis Ioannidis, M. Leeser
Secure Function Evaluation (SFE) has received considerable attention recently due to the massive collection and mining of personal data over the Internet, but large computational costs still render it impractical. In this paper, we leverage hardware acceleration to tackle the scalability and efficiency challenges inherent in SFE. To that end, we propose a generic, reconfigurable implementation of SFE as a coarse-grained FPGA overlay architecture. Contrary to tailored approaches that are tied to the execution of a specific SFE structure, and require full reprogramming of an FPGA with each new execution, our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming. Our implementation shows orders of magnitude improvement over a software package for evaluating garbled circuits, and demonstrates that the circuit being evaluated can change with almost no overhead.
由于互联网上个人数据的大量收集和挖掘,安全功能评估(SFE)最近受到了相当大的关注,但巨大的计算成本仍然使其不切实际。在本文中,我们利用硬件加速来解决SFE固有的可伸缩性和效率挑战。为此,我们提出了一种通用的、可重构的SFE实现,作为一种粗粒度的FPGA覆盖架构。与与特定SFE结构的执行相关联的定制方法相反,每次新执行都需要对FPGA进行完全重新编程,我们的设计允许重新利用FPGA来评估不同的SFE任务,而无需重新编程。我们的实现显示了对评估乱码电路的软件包的数量级改进,并证明了被评估的电路可以在几乎没有开销的情况下进行更改。
{"title":"Secure Function Evaluation Using an FPGA Overlay Architecture","authors":"Xin Fang, Stratis Ioannidis, M. Leeser","doi":"10.1145/3020078.3021746","DOIUrl":"https://doi.org/10.1145/3020078.3021746","url":null,"abstract":"Secure Function Evaluation (SFE) has received considerable attention recently due to the massive collection and mining of personal data over the Internet, but large computational costs still render it impractical. In this paper, we leverage hardware acceleration to tackle the scalability and efficiency challenges inherent in SFE. To that end, we propose a generic, reconfigurable implementation of SFE as a coarse-grained FPGA overlay architecture. Contrary to tailored approaches that are tied to the execution of a specific SFE structure, and require full reprogramming of an FPGA with each new execution, our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming. Our implementation shows orders of magnitude improvement over a software package for evaluating garbled circuits, and demonstrates that the circuit being evaluated can change with almost no overhead.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115146342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Automatic Construction of Program-Optimized FPGA Memory Networks 程序优化FPGA存储网络的自动构建
Hsin-Jung Yang, Kermin Fleming, F. Winterstein, Annie I. Chen, Michael Adler, J. Emer
Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, program-optimized construction of low-latency memory networks. We design a feedback-driven network compiler, which constructs an optimized memory network based on the target program's memory access behavior measured via a newly designed network profiler. In our test applications, the compiler-optimized networks provide a 45% performance gain on average over baseline memory networks by minimizing the impact of network latency on program performance.
存储系统在FPGA应用的性能中起着关键作用。随着FPGA部署向更串行化的设计入口点移动,内存延迟已成为一个重要的设计考虑因素。对于这些应用程序,内存网络优化对于提高性能至关重要。在本文中,我们研究了低延迟存储网络的自动、程序优化构建。我们设计了一个反馈驱动的网络编译器,该编译器通过一个新设计的网络分析器来测量目标程序的内存访问行为,从而构建一个优化的内存网络。在我们的测试应用程序中,通过最小化网络延迟对程序性能的影响,编译器优化的网络比基准内存网络平均提供了45%的性能增益。
{"title":"Automatic Construction of Program-Optimized FPGA Memory Networks","authors":"Hsin-Jung Yang, Kermin Fleming, F. Winterstein, Annie I. Chen, Michael Adler, J. Emer","doi":"10.1145/3020078.3021748","DOIUrl":"https://doi.org/10.1145/3020078.3021748","url":null,"abstract":"Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, program-optimized construction of low-latency memory networks. We design a feedback-driven network compiler, which constructs an optimized memory network based on the target program's memory access behavior measured via a newly designed network profiler. In our test applications, the compiler-optimized networks provide a 45% performance gain on average over baseline memory networks by minimizing the impact of network latency on program performance.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123312090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element NAND-NOR:一种紧凑、快速、延迟平衡的FPGA逻辑元件
Zhihong Huang, Xing Wei, Grace Zgheib, Wei Li, Y. Lin, Zhenghong Jiang, Kaihui Tu, P. Ienne, Haigang Yang
The And-Inverter Cone has been introduced as an alternative logic element to the look-up table in FPGAs, since it improves their performance and resource utilization. However, further analysis of the AIC design showed that it suffers from the delay discrepancy problem. Furthermore, the existing AIC cluster design is not properly optimized and has some unnecessary logic that impedes its performance. Thus, we propose in this work a more efficient logic element called NAND-NOR and a delay-balanced dual-phased multiplexers for the input crossbar. Our simulations show that the NAND-NOR brings substantial reduction in delay discrepancy with a 14% to 46% delay improvement when compared to AICs. And, along with the other modifications, it reduces the total cluster area by about 27%, when compared to the reference AIC cluster. Testing the new architecture on a large set of benchmarks shows an improvement of the delay-area product by about 44% and 21% for the MCNC and VTR benchmarks, respectively, when compared to LUT-based cluster. This improvement reaches 31% and 19%, respectively, when compared to the AIC-based architecture.
在fpga中引入and -逆变锥作为查找表的替代逻辑元件,因为它提高了它们的性能和资源利用率。然而,对AIC设计的进一步分析表明,它存在延迟差异问题。此外,现有的AIC集群设计没有得到适当的优化,并且存在一些不必要的逻辑,阻碍了其性能。因此,我们在这项工作中提出了一个更有效的逻辑元件,称为NAND-NOR和一个延迟平衡的双相多路复用器的输入交叉排。我们的模拟表明,与aic相比,NAND-NOR带来了显著的延迟差异减少,延迟改善了14%到46%。而且,与参考AIC集群相比,与其他修改一起,它将总集群面积减少了约27%。在大量基准测试上测试新架构显示,与基于lutt的集群相比,MCNC和VTR基准测试的延迟区产品分别提高了约44%和21%。与基于aic的体系结构相比,这种改进分别达到31%和19%。
{"title":"NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element","authors":"Zhihong Huang, Xing Wei, Grace Zgheib, Wei Li, Y. Lin, Zhenghong Jiang, Kaihui Tu, P. Ienne, Haigang Yang","doi":"10.1145/3020078.3021750","DOIUrl":"https://doi.org/10.1145/3020078.3021750","url":null,"abstract":"The And-Inverter Cone has been introduced as an alternative logic element to the look-up table in FPGAs, since it improves their performance and resource utilization. However, further analysis of the AIC design showed that it suffers from the delay discrepancy problem. Furthermore, the existing AIC cluster design is not properly optimized and has some unnecessary logic that impedes its performance. Thus, we propose in this work a more efficient logic element called NAND-NOR and a delay-balanced dual-phased multiplexers for the input crossbar. Our simulations show that the NAND-NOR brings substantial reduction in delay discrepancy with a 14% to 46% delay improvement when compared to AICs. And, along with the other modifications, it reduces the total cluster area by about 27%, when compared to the reference AIC cluster. Testing the new architecture on a large set of benchmarks shows an improvement of the delay-area product by about 44% and 21% for the MCNC and VTR benchmarks, respectively, when compared to LUT-based cluster. This improvement reaches 31% and 19%, respectively, when compared to the AIC-based architecture.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122637463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors (Abstract Only) 测量fpga和处理器之间的功耗约束性能和能量差距(仅摘要)
A. Ye, K. Ganesan
This work measures the performance and power consumption gap between the current generation of low power FPGAs and low power microprocessors (microcontrollers) through an implementation of the Canny edge detection algorithm. In particular, the algorithm is implemented on Altera MAX 10 FPGAs and its performance and power consumption are then compared to the same algorithm implemented on the STMicroelectronics' implementation of the ARM M-series microcontrollers. We found an extremely high, four- to five-orders of magnitude, performance advantage of the FPGAs over the microcontrollers, which is much greater than any previously reported values in FPGAs vs. processors studies. Furthermore, this speedup only comes at a cost of 1.2x to 15x higher power consumption, which gives FPGAs a significant advantage in energy efficiency. We also observe, however, the current generation of low power FPGAs have significantly higher static power consumption than the microcontrollers. In particular, the low power FPGAs consume more static power than the total power consumption of the lowest power consuming microcontrollers, rendering the FPGAs inoperable under the power budgets of these processors. Furthermore, this high static power consumption exists despite the fact that the FPGAs are implemented on a low leakage 55nm process with dual supply voltages while the microcontrollers are implemented on a conventional, single supply voltage, 90nm process. Consequently, our results indicate that it is particular important for future research to address the static power consumption of low power FPGAs while maintaining logic capacity so the performance and energy efficiency advantages of the FPGAs can be fully utilized in the extremely low power application domain that are driven by batteries with very small form factors and emerging small scale energy harvesting technologies.
这项工作通过Canny边缘检测算法的实现来测量当前一代低功耗fpga和低功耗微处理器(微控制器)之间的性能和功耗差距。特别是,该算法在Altera MAX 10 fpga上实现,然后将其性能和功耗与意法半导体在ARM m系列微控制器上实现的相同算法进行比较。我们发现fpga相对于微控制器的性能优势非常高,有4到5个数量级,这比以前在fpga与处理器研究中报道的任何值都要大得多。此外,这种加速仅以1.2倍至15倍的功耗为代价,这使得fpga在能源效率方面具有显着优势。然而,我们也观察到,当前一代的低功耗fpga具有明显高于微控制器的静态功耗。特别是,低功耗fpga比最低功耗微控制器的总功耗消耗更多的静态功耗,使得fpga在这些处理器的功率预算下无法运行。此外,尽管fpga采用双电源电压的低泄漏55nm工艺实现,而微控制器采用传统的单电源电压90nm工艺实现,但这种高静态功耗仍然存在。因此,我们的研究结果表明,在保持逻辑容量的同时解决低功耗fpga的静态功耗问题对于未来的研究尤为重要,这样fpga的性能和能效优势就可以在极低功耗应用领域得到充分利用,这些应用领域是由非常小的外形尺寸的电池和新兴的小规模能量收集技术驱动的。
{"title":"Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors (Abstract Only)","authors":"A. Ye, K. Ganesan","doi":"10.1145/3020078.3021756","DOIUrl":"https://doi.org/10.1145/3020078.3021756","url":null,"abstract":"This work measures the performance and power consumption gap between the current generation of low power FPGAs and low power microprocessors (microcontrollers) through an implementation of the Canny edge detection algorithm. In particular, the algorithm is implemented on Altera MAX 10 FPGAs and its performance and power consumption are then compared to the same algorithm implemented on the STMicroelectronics' implementation of the ARM M-series microcontrollers. We found an extremely high, four- to five-orders of magnitude, performance advantage of the FPGAs over the microcontrollers, which is much greater than any previously reported values in FPGAs vs. processors studies. Furthermore, this speedup only comes at a cost of 1.2x to 15x higher power consumption, which gives FPGAs a significant advantage in energy efficiency. We also observe, however, the current generation of low power FPGAs have significantly higher static power consumption than the microcontrollers. In particular, the low power FPGAs consume more static power than the total power consumption of the lowest power consuming microcontrollers, rendering the FPGAs inoperable under the power budgets of these processors. Furthermore, this high static power consumption exists despite the fact that the FPGAs are implemented on a low leakage 55nm process with dual supply voltages while the microcontrollers are implemented on a conventional, single supply voltage, 90nm process. Consequently, our results indicate that it is particular important for future research to address the static power consumption of low power FPGAs while maintaining logic capacity so the performance and energy efficiency advantages of the FPGAs can be fully utilized in the extremely low power application domain that are driven by batteries with very small form factors and emerging small scale energy harvesting technologies.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"276 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only) 来自SoCFPGA一致性端口的缓存定时攻击(仅摘要)
S. Chaudhuri
In this presentation we show that side-channels arising from micro-architecture of SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA. Its primary goal is to derive physical addresses used by the Linux kernel on ARM Hard Processor System. With this information the trojan can then surgically change memory locations to gain privileges as in a rootkit. We present the customisation to the Altera OpenCL platform, and the OpenCL code to implement the trojan. We show that it is possible to accurately predict physical addresses and the page table entries corresponding to an arbitrary location in the heap after sufficient (~300) iterations, and by using a differential ranking. The attack can be refined by the known page table structure of the Linux kernel, to accurately determine the target physical address, and its corresponding page table entry. Malicious code can then be injected from FPGA, by redirecting page table entries. Since Linux kernel version 4.0-rc5 physical addresses are obfuscated from the normal user to prevent Rowhammer attacks. With information from ACP side-channel the above measure can be bypassed.
在本演讲中,我们展示了由socfpga微架构产生的侧信道可能是一种安全风险。提出了一种基于OpenCL的FPGA木马,该木马通过SoCFPGA的加速器相干端口(ACP)执行缓存定时攻击。它的主要目标是获得Linux内核在ARM硬处理器系统上使用的物理地址。有了这些信息,木马就可以像rootkit一样改变内存位置以获得特权。我们给出了Altera OpenCL平台的定制,以及实现该木马的OpenCL代码。我们证明,在足够的(~300)迭代之后,通过使用差分排序,可以准确地预测物理地址和对应于堆中任意位置的页表项。这种攻击可以通过已知的Linux内核页表结构进行细化,准确地确定目标的物理地址,以及其对应的页表入口。恶意代码可以通过重定向页表条目从FPGA注入。由于Linux内核版本4.0-rc5的物理地址与普通用户混淆,以防止Rowhammer攻击。利用ACP侧信道的信息,可以绕过上述措施。
{"title":"Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only)","authors":"S. Chaudhuri","doi":"10.1145/3020078.3021802","DOIUrl":"https://doi.org/10.1145/3020078.3021802","url":null,"abstract":"In this presentation we show that side-channels arising from micro-architecture of SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA. Its primary goal is to derive physical addresses used by the Linux kernel on ARM Hard Processor System. With this information the trojan can then surgically change memory locations to gain privileges as in a rootkit. We present the customisation to the Altera OpenCL platform, and the OpenCL code to implement the trojan. We show that it is possible to accurately predict physical addresses and the page table entries corresponding to an arbitrary location in the heap after sufficient (~300) iterations, and by using a differential ranking. The attack can be refined by the known page table structure of the Linux kernel, to accurately determine the target physical address, and its corresponding page table entry. Malicious code can then be injected from FPGA, by redirecting page table entries. Since Linux kernel version 4.0-rc5 physical addresses are obfuscated from the normal user to prevent Rowhammer attacks. With information from ACP side-channel the above measure can be bypassed.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121228973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules 基于HMC内存的fpga数据包匹配:迈向一百万规则
Daniel Rozhko, Geoffrey Elliott, D. Ly-Ma, P. Chow, H. Jacobsen
Packet processing systems increasingly need larger rulesets to satisfy the needs of deep-network intrusion prevention and cluster computing. FPGA-based implementations of packet processing systems have been proposed but their use of on-chip memory limits the number of rules these existing systems can maintain. Off-chip memories have traditionally been too slow to enable meaningful processing rates, but in this work we present a packet processing system that utilizes the much faster Hybrid Memory Cube (HMC) technology, enabling larger rulesets at usable line-rates. The proposed architecture streams rules from the HMC memory to a packet matching engine, using prefetching to hide the HMC access latency. The packet matching engine is replicated to process multiple packets in parallel. The final system, implemented on a Xilinx Kintex Ultrascale 060, processes 160 packets in parallel, achieving a 10~Gbps line-rate with approximately 1500 rules and a 16~Mbps line-rate with 1M rules. To the best of our knowledge, this is the first hardware solution capable of maintaining rulesets of this size. We present this work as an exploration of the application of HMCs to packet processing and as a first step in achieving a processing capability of a million rules at usable line-rates.
为了满足深度网络入侵防御和集群计算的需要,报文处理系统对规则集的需求日益增大。已经提出了基于fpga的数据包处理系统的实现,但是它们对片上存储器的使用限制了这些现有系统可以维护的规则的数量。片外存储器传统上太慢,无法实现有意义的处理速率,但在这项工作中,我们提出了一个利用更快的混合内存立方体(HMC)技术的分组处理系统,在可用的线速率下实现更大的规则集。提出的体系结构将规则从HMC内存流到数据包匹配引擎,使用预取来隐藏HMC访问延迟。数据包匹配引擎被复制以并行处理多个数据包。最终的系统在Xilinx Kintex Ultrascale 060上实现,并行处理160个数据包,在大约1500条规则下实现10~Gbps的线路速率,在1M条规则下实现16~Mbps的线路速率。据我们所知,这是第一个能够维护如此大小的规则集的硬件解决方案。我们将这项工作作为hmc在分组处理中的应用的探索,并作为实现以可用线路速率处理一百万条规则能力的第一步。
{"title":"Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules","authors":"Daniel Rozhko, Geoffrey Elliott, D. Ly-Ma, P. Chow, H. Jacobsen","doi":"10.1145/3020078.3021752","DOIUrl":"https://doi.org/10.1145/3020078.3021752","url":null,"abstract":"Packet processing systems increasingly need larger rulesets to satisfy the needs of deep-network intrusion prevention and cluster computing. FPGA-based implementations of packet processing systems have been proposed but their use of on-chip memory limits the number of rules these existing systems can maintain. Off-chip memories have traditionally been too slow to enable meaningful processing rates, but in this work we present a packet processing system that utilizes the much faster Hybrid Memory Cube (HMC) technology, enabling larger rulesets at usable line-rates. The proposed architecture streams rules from the HMC memory to a packet matching engine, using prefetching to hide the HMC access latency. The packet matching engine is replicated to process multiple packets in parallel. The final system, implemented on a Xilinx Kintex Ultrascale 060, processes 160 packets in parallel, achieving a 10~Gbps line-rate with approximately 1500 rules and a 16~Mbps line-rate with 1M rules. To the best of our knowledge, this is the first hardware solution capable of maintaining rulesets of this size. We present this work as an exploration of the application of HMCs to packet processing and as a first step in achieving a processing capability of a million rules at usable line-rates.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125713999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology Mapping 基于lut的技术映射区域优化的并行迭代改进方法
Gai Liu, Zhiru Zhang
Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. In this paper we propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework to minimize the circuit area for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that our approach achieves promising area improvement over a set of commonly used benchmarks. Notably, PIMap reduces the LUT usage by up to 14% and 7% on average over the best-known records for the EPFL arithmetic benchmark suite.
现代FPGA合成工具通常在执行技术映射之前在输入逻辑网络上应用预定的逻辑优化序列。虽然逻辑转换的“已知配方”经常导致改进的映射结果,但是在驱动预映射逻辑优化的质量度量和实际技术映射的目标之间仍然存在一个重要的差距。不用说,这种不相关最终会导致结果的次优质量。在本文中,我们提出了PIMap,它将逻辑转换和技术映射结合在一个迭代改进框架下,以最大限度地减少基于lut的fpga的电路面积。在每次迭代中,PIMap从候选优化集合中随机提出给定逻辑网络上的转换;然后调用技术映射,并利用映射结果来确定接受提议转换的可能性。为了减轻运行时开销,我们进一步引入并行化技术,将大型设计分解为多个可以同时优化的较小子网络列表。实验结果表明,与一组常用的基准测试相比,我们的方法实现了有希望的区域改进。值得注意的是,与EPFL算法基准套件中最著名的记录相比,PIMap将LUT使用量平均减少了14%和7%。
{"title":"A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology Mapping","authors":"Gai Liu, Zhiru Zhang","doi":"10.1145/3020078.3021735","DOIUrl":"https://doi.org/10.1145/3020078.3021735","url":null,"abstract":"Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the \"known recipes\" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. In this paper we propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework to minimize the circuit area for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that our approach achieves promising area improvement over a set of commonly used benchmarks. Notably, PIMap reduces the LUT usage by up to 14% and 7% on average over the best-known records for the EPFL arithmetic benchmark suite.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129186076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays 特定组件映射中的质量时间权衡:如何训练具有惊人网络延迟的动态可重构门阵列
Hans Giesen, Raphael Rubin, Benjamin Gojman, A. DeHon
How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated that the negative effects of variation can be largely mitigated using complete knowledge of device characteristics and full per-FPGA CAD flow. However, the cost of per-FPGA characterization and mapping could be prohibitively expensive. We explore light-weight options for per-FPGA mapping that avoid the need for a priori device characterization and perform less expensive per FPGA customization work. We characterize the tradeoff between Quality-of-Results (energy, delay) and per-device mapping costs for 7 design points ranging from complete mapping based on knowledge to no per-device mapping. We show that it is possible to get 48-77% of the component-specific mapping delay benefit or 57% of the energy benefit with a mapping that takes less than 20 seconds per FPGA. An incremental solution can start execution after a 21 ms bitstream load and converge to 77% delay benefit after 18 seconds of runtime.
我们应该如何为fpga执行特定组件的适配?先前的工作已经证明,使用完整的器件特性知识和完整的fpga CAD流程,可以在很大程度上减轻变化的负面影响。然而,每个fpga表征和映射的成本可能非常昂贵。我们探索了每个FPGA映射的轻量级选项,避免了对先验器件特性的需要,并且执行每个FPGA定制工作的成本更低。我们描述了7个设计点的结果质量(能量,延迟)和每个设备映射成本之间的权衡,从基于知识的完整映射到没有每个设备映射。我们表明,通过每个FPGA不到20秒的映射,可以获得48-77%的特定组件映射延迟优势或57%的能量优势。增量解决方案可以在21毫秒的比特流加载后开始执行,并在18秒运行后收敛到77%的延迟优势。
{"title":"Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays","authors":"Hans Giesen, Raphael Rubin, Benjamin Gojman, A. DeHon","doi":"10.1145/3020078.3026124","DOIUrl":"https://doi.org/10.1145/3020078.3026124","url":null,"abstract":"How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated that the negative effects of variation can be largely mitigated using complete knowledge of device characteristics and full per-FPGA CAD flow. However, the cost of per-FPGA characterization and mapping could be prohibitively expensive. We explore light-weight options for per-FPGA mapping that avoid the need for a priori device characterization and perform less expensive per FPGA customization work. We characterize the tradeoff between Quality-of-Results (energy, delay) and per-device mapping costs for 7 design points ranging from complete mapping based on knowledge to no per-device mapping. We show that it is possible to get 48-77% of the component-specific mapping delay benefit or 57% of the energy benefit with a mapping that takes less than 20 seconds per FPGA. An incremental solution can start execution after a 21 ms bitstream load and converge to 77% delay benefit after 18 seconds of runtime.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132758465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only) 使用Vivado-HLS进行结构设计:NoC案例研究(仅摘要)
Zhipeng Zhao, J. Hoe
There have been ample successful examples of applying Xilinx Vivado's "function-to-module" high-level synthesis (HLS) where the subject is algorithmic in nature. In this work, we carried out a design study to assess the effectiveness of applying Vivado-HLS in structural design. We employed Vivado-HLS to synthesize C functions corresponding to standalone network-on-chip (NoC) routers as well as complete multi-endpoint NoCs. Interestingly, we find that describing a complete NoC comprising router submodules faces fundamental difficulties not present in describing the routers as standalone modules. Ultimately, we succeeded in using Vivado-HLS to produce router and NoC modules that are exact cycle- and bit-accurate replacements of our reference RTL-based router and NoC modules. Furthermore, the routers and NoCs resulting from HLS and RTL are comparable in resource utilization and critical path delay. Our experience subjectively suggests that HLS is able to simplify the design effort even though much of the structural details had to be provided in the HLS description through a combination of coding discipline and explicit pragmas. The C++ source code and a more extensive description of this work can be found at http://www.ece.cmu.edu/calcm/connect_hls.
在应用赛灵思Vivado的“函数到模块”高级综合(HLS)方面,已经有很多成功的例子,其中的主题本质上是算法。在这项工作中,我们进行了一项设计研究,以评估在结构设计中应用Vivado-HLS的有效性。我们使用Vivado-HLS合成了对应于独立的片上网络(NoC)路由器和完整的多端点NoC的C函数。有趣的是,我们发现描述包含路由器子模块的完整NoC面临着将路由器描述为独立模块所不存在的基本困难。最终,我们成功地使用Vivado-HLS生产了路由器和NoC模块,这些模块是基于rtl的参考路由器和NoC模块的精确周期和位精确替代品。此外,HLS和RTL产生的路由器和noc在资源利用率和关键路径延迟方面具有可比性。我们的经验主观上表明,HLS能够简化设计工作,即使许多结构细节必须通过编码规则和显式语用的组合在HLS描述中提供。c++源代码和更广泛的描述可以在http://www.ece.cmu.edu/calcm/connect_hls上找到。
{"title":"Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)","authors":"Zhipeng Zhao, J. Hoe","doi":"10.1145/3020078.3021772","DOIUrl":"https://doi.org/10.1145/3020078.3021772","url":null,"abstract":"There have been ample successful examples of applying Xilinx Vivado's \"function-to-module\" high-level synthesis (HLS) where the subject is algorithmic in nature. In this work, we carried out a design study to assess the effectiveness of applying Vivado-HLS in structural design. We employed Vivado-HLS to synthesize C functions corresponding to standalone network-on-chip (NoC) routers as well as complete multi-endpoint NoCs. Interestingly, we find that describing a complete NoC comprising router submodules faces fundamental difficulties not present in describing the routers as standalone modules. Ultimately, we succeeded in using Vivado-HLS to produce router and NoC modules that are exact cycle- and bit-accurate replacements of our reference RTL-based router and NoC modules. Furthermore, the routers and NoCs resulting from HLS and RTL are comparable in resource utilization and critical path delay. Our experience subjectively suggests that HLS is able to simplify the design effort even though much of the structural details had to be provided in the HLS description through a combination of coding discipline and explicit pragmas. The C++ source code and a more extensive description of this work can be found at http://www.ece.cmu.edu/calcm/connect_hls.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133555116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A Parallel Bandit-Based Approach for Autotuning FPGA Compilation 一种基于并行强盗的FPGA自动调谐方法
Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, Zhiru Zhang
Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.
主流FPGA CAD工具提供了广泛的优化选项集合,这些选项对最终设计的质量有重大影响。这些选项共同创造了一个巨大而复杂的设计空间,仅靠人类的努力是无法有效探索的。相反,我们建议使用自动调优来搜索这个参数空间,这在编译器优化领域是一种流行的方法。具体而言,我们研究了应用多臂强盗(MAB)技术自动调整从RTL到比特流的完整FPGA编译流选项的有效性,包括RTL/逻辑合成,技术映射,放置和路由。为了减轻复杂FPGA实现过程所带来的高运行时成本,我们设计了一种高效的并行化方案,使多个基于mab的自动调谐器能够同时探索设计空间。特别地,我们提出了一种动态解空间划分和资源分配技术,该技术基于前几次迭代的搜索质量运行时信息,智能地将计算资源分配到有希望的搜索区域。在学术和商业FPGA CAD工具上的实验表明,在各种实际设计的质量和收敛速度方面有很大的改进。
{"title":"A Parallel Bandit-Based Approach for Autotuning FPGA Compilation","authors":"Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, Zhiru Zhang","doi":"10.1145/3020078.3021747","DOIUrl":"https://doi.org/10.1145/3020078.3021747","url":null,"abstract":"Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131040699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
期刊
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1