Secure Function Evaluation (SFE) has received considerable attention recently due to the massive collection and mining of personal data over the Internet, but large computational costs still render it impractical. In this paper, we leverage hardware acceleration to tackle the scalability and efficiency challenges inherent in SFE. To that end, we propose a generic, reconfigurable implementation of SFE as a coarse-grained FPGA overlay architecture. Contrary to tailored approaches that are tied to the execution of a specific SFE structure, and require full reprogramming of an FPGA with each new execution, our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming. Our implementation shows orders of magnitude improvement over a software package for evaluating garbled circuits, and demonstrates that the circuit being evaluated can change with almost no overhead.
{"title":"Secure Function Evaluation Using an FPGA Overlay Architecture","authors":"Xin Fang, Stratis Ioannidis, M. Leeser","doi":"10.1145/3020078.3021746","DOIUrl":"https://doi.org/10.1145/3020078.3021746","url":null,"abstract":"Secure Function Evaluation (SFE) has received considerable attention recently due to the massive collection and mining of personal data over the Internet, but large computational costs still render it impractical. In this paper, we leverage hardware acceleration to tackle the scalability and efficiency challenges inherent in SFE. To that end, we propose a generic, reconfigurable implementation of SFE as a coarse-grained FPGA overlay architecture. Contrary to tailored approaches that are tied to the execution of a specific SFE structure, and require full reprogramming of an FPGA with each new execution, our design allows repurposing an FPGA to evaluate different SFE tasks without the need for reprogramming. Our implementation shows orders of magnitude improvement over a software package for evaluating garbled circuits, and demonstrates that the circuit being evaluated can change with almost no overhead.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115146342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsin-Jung Yang, Kermin Fleming, F. Winterstein, Annie I. Chen, Michael Adler, J. Emer
Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, program-optimized construction of low-latency memory networks. We design a feedback-driven network compiler, which constructs an optimized memory network based on the target program's memory access behavior measured via a newly designed network profiler. In our test applications, the compiler-optimized networks provide a 45% performance gain on average over baseline memory networks by minimizing the impact of network latency on program performance.
{"title":"Automatic Construction of Program-Optimized FPGA Memory Networks","authors":"Hsin-Jung Yang, Kermin Fleming, F. Winterstein, Annie I. Chen, Michael Adler, J. Emer","doi":"10.1145/3020078.3021748","DOIUrl":"https://doi.org/10.1145/3020078.3021748","url":null,"abstract":"Memory systems play a key role in the performance of FPGA applications. As FPGA deployments move towards design entry points that are more serial, memory latency has become a serious design consideration. For these applications, memory network optimization is essential in improving performance. In this paper, we examine the automatic, program-optimized construction of low-latency memory networks. We design a feedback-driven network compiler, which constructs an optimized memory network based on the target program's memory access behavior measured via a newly designed network profiler. In our test applications, the compiler-optimized networks provide a 45% performance gain on average over baseline memory networks by minimizing the impact of network latency on program performance.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123312090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihong Huang, Xing Wei, Grace Zgheib, Wei Li, Y. Lin, Zhenghong Jiang, Kaihui Tu, P. Ienne, Haigang Yang
The And-Inverter Cone has been introduced as an alternative logic element to the look-up table in FPGAs, since it improves their performance and resource utilization. However, further analysis of the AIC design showed that it suffers from the delay discrepancy problem. Furthermore, the existing AIC cluster design is not properly optimized and has some unnecessary logic that impedes its performance. Thus, we propose in this work a more efficient logic element called NAND-NOR and a delay-balanced dual-phased multiplexers for the input crossbar. Our simulations show that the NAND-NOR brings substantial reduction in delay discrepancy with a 14% to 46% delay improvement when compared to AICs. And, along with the other modifications, it reduces the total cluster area by about 27%, when compared to the reference AIC cluster. Testing the new architecture on a large set of benchmarks shows an improvement of the delay-area product by about 44% and 21% for the MCNC and VTR benchmarks, respectively, when compared to LUT-based cluster. This improvement reaches 31% and 19%, respectively, when compared to the AIC-based architecture.
{"title":"NAND-NOR: A Compact, Fast, and Delay Balanced FPGA Logic Element","authors":"Zhihong Huang, Xing Wei, Grace Zgheib, Wei Li, Y. Lin, Zhenghong Jiang, Kaihui Tu, P. Ienne, Haigang Yang","doi":"10.1145/3020078.3021750","DOIUrl":"https://doi.org/10.1145/3020078.3021750","url":null,"abstract":"The And-Inverter Cone has been introduced as an alternative logic element to the look-up table in FPGAs, since it improves their performance and resource utilization. However, further analysis of the AIC design showed that it suffers from the delay discrepancy problem. Furthermore, the existing AIC cluster design is not properly optimized and has some unnecessary logic that impedes its performance. Thus, we propose in this work a more efficient logic element called NAND-NOR and a delay-balanced dual-phased multiplexers for the input crossbar. Our simulations show that the NAND-NOR brings substantial reduction in delay discrepancy with a 14% to 46% delay improvement when compared to AICs. And, along with the other modifications, it reduces the total cluster area by about 27%, when compared to the reference AIC cluster. Testing the new architecture on a large set of benchmarks shows an improvement of the delay-area product by about 44% and 21% for the MCNC and VTR benchmarks, respectively, when compared to LUT-based cluster. This improvement reaches 31% and 19%, respectively, when compared to the AIC-based architecture.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122637463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work measures the performance and power consumption gap between the current generation of low power FPGAs and low power microprocessors (microcontrollers) through an implementation of the Canny edge detection algorithm. In particular, the algorithm is implemented on Altera MAX 10 FPGAs and its performance and power consumption are then compared to the same algorithm implemented on the STMicroelectronics' implementation of the ARM M-series microcontrollers. We found an extremely high, four- to five-orders of magnitude, performance advantage of the FPGAs over the microcontrollers, which is much greater than any previously reported values in FPGAs vs. processors studies. Furthermore, this speedup only comes at a cost of 1.2x to 15x higher power consumption, which gives FPGAs a significant advantage in energy efficiency. We also observe, however, the current generation of low power FPGAs have significantly higher static power consumption than the microcontrollers. In particular, the low power FPGAs consume more static power than the total power consumption of the lowest power consuming microcontrollers, rendering the FPGAs inoperable under the power budgets of these processors. Furthermore, this high static power consumption exists despite the fact that the FPGAs are implemented on a low leakage 55nm process with dual supply voltages while the microcontrollers are implemented on a conventional, single supply voltage, 90nm process. Consequently, our results indicate that it is particular important for future research to address the static power consumption of low power FPGAs while maintaining logic capacity so the performance and energy efficiency advantages of the FPGAs can be fully utilized in the extremely low power application domain that are driven by batteries with very small form factors and emerging small scale energy harvesting technologies.
这项工作通过Canny边缘检测算法的实现来测量当前一代低功耗fpga和低功耗微处理器(微控制器)之间的性能和功耗差距。特别是,该算法在Altera MAX 10 fpga上实现,然后将其性能和功耗与意法半导体在ARM m系列微控制器上实现的相同算法进行比较。我们发现fpga相对于微控制器的性能优势非常高,有4到5个数量级,这比以前在fpga与处理器研究中报道的任何值都要大得多。此外,这种加速仅以1.2倍至15倍的功耗为代价,这使得fpga在能源效率方面具有显着优势。然而,我们也观察到,当前一代的低功耗fpga具有明显高于微控制器的静态功耗。特别是,低功耗fpga比最低功耗微控制器的总功耗消耗更多的静态功耗,使得fpga在这些处理器的功率预算下无法运行。此外,尽管fpga采用双电源电压的低泄漏55nm工艺实现,而微控制器采用传统的单电源电压90nm工艺实现,但这种高静态功耗仍然存在。因此,我们的研究结果表明,在保持逻辑容量的同时解决低功耗fpga的静态功耗问题对于未来的研究尤为重要,这样fpga的性能和能效优势就可以在极低功耗应用领域得到充分利用,这些应用领域是由非常小的外形尺寸的电池和新兴的小规模能量收集技术驱动的。
{"title":"Measuring the Power-Constrained Performance and Energy Gap between FPGAs and Processors (Abstract Only)","authors":"A. Ye, K. Ganesan","doi":"10.1145/3020078.3021756","DOIUrl":"https://doi.org/10.1145/3020078.3021756","url":null,"abstract":"This work measures the performance and power consumption gap between the current generation of low power FPGAs and low power microprocessors (microcontrollers) through an implementation of the Canny edge detection algorithm. In particular, the algorithm is implemented on Altera MAX 10 FPGAs and its performance and power consumption are then compared to the same algorithm implemented on the STMicroelectronics' implementation of the ARM M-series microcontrollers. We found an extremely high, four- to five-orders of magnitude, performance advantage of the FPGAs over the microcontrollers, which is much greater than any previously reported values in FPGAs vs. processors studies. Furthermore, this speedup only comes at a cost of 1.2x to 15x higher power consumption, which gives FPGAs a significant advantage in energy efficiency. We also observe, however, the current generation of low power FPGAs have significantly higher static power consumption than the microcontrollers. In particular, the low power FPGAs consume more static power than the total power consumption of the lowest power consuming microcontrollers, rendering the FPGAs inoperable under the power budgets of these processors. Furthermore, this high static power consumption exists despite the fact that the FPGAs are implemented on a low leakage 55nm process with dual supply voltages while the microcontrollers are implemented on a conventional, single supply voltage, 90nm process. Consequently, our results indicate that it is particular important for future research to address the static power consumption of low power FPGAs while maintaining logic capacity so the performance and energy efficiency advantages of the FPGAs can be fully utilized in the extremely low power application domain that are driven by batteries with very small form factors and emerging small scale energy harvesting technologies.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"276 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124212785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this presentation we show that side-channels arising from micro-architecture of SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA. Its primary goal is to derive physical addresses used by the Linux kernel on ARM Hard Processor System. With this information the trojan can then surgically change memory locations to gain privileges as in a rootkit. We present the customisation to the Altera OpenCL platform, and the OpenCL code to implement the trojan. We show that it is possible to accurately predict physical addresses and the page table entries corresponding to an arbitrary location in the heap after sufficient (~300) iterations, and by using a differential ranking. The attack can be refined by the known page table structure of the Linux kernel, to accurately determine the target physical address, and its corresponding page table entry. Malicious code can then be injected from FPGA, by redirecting page table entries. Since Linux kernel version 4.0-rc5 physical addresses are obfuscated from the normal user to prevent Rowhammer attacks. With information from ACP side-channel the above measure can be bypassed.
{"title":"Cache Timing Attacks from The SoCFPGA Coherency Port (Abstract Only)","authors":"S. Chaudhuri","doi":"10.1145/3020078.3021802","DOIUrl":"https://doi.org/10.1145/3020078.3021802","url":null,"abstract":"In this presentation we show that side-channels arising from micro-architecture of SoCFPGAs could be a security risk. We present a FPGA trojan based on OpenCL which performs cache-timing attacks through the accelerator coherency port (ACP) of a SoCFPGA. Its primary goal is to derive physical addresses used by the Linux kernel on ARM Hard Processor System. With this information the trojan can then surgically change memory locations to gain privileges as in a rootkit. We present the customisation to the Altera OpenCL platform, and the OpenCL code to implement the trojan. We show that it is possible to accurately predict physical addresses and the page table entries corresponding to an arbitrary location in the heap after sufficient (~300) iterations, and by using a differential ranking. The attack can be refined by the known page table structure of the Linux kernel, to accurately determine the target physical address, and its corresponding page table entry. Malicious code can then be injected from FPGA, by redirecting page table entries. Since Linux kernel version 4.0-rc5 physical addresses are obfuscated from the normal user to prevent Rowhammer attacks. With information from ACP side-channel the above measure can be bypassed.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121228973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Rozhko, Geoffrey Elliott, D. Ly-Ma, P. Chow, H. Jacobsen
Packet processing systems increasingly need larger rulesets to satisfy the needs of deep-network intrusion prevention and cluster computing. FPGA-based implementations of packet processing systems have been proposed but their use of on-chip memory limits the number of rules these existing systems can maintain. Off-chip memories have traditionally been too slow to enable meaningful processing rates, but in this work we present a packet processing system that utilizes the much faster Hybrid Memory Cube (HMC) technology, enabling larger rulesets at usable line-rates. The proposed architecture streams rules from the HMC memory to a packet matching engine, using prefetching to hide the HMC access latency. The packet matching engine is replicated to process multiple packets in parallel. The final system, implemented on a Xilinx Kintex Ultrascale 060, processes 160 packets in parallel, achieving a 10~Gbps line-rate with approximately 1500 rules and a 16~Mbps line-rate with 1M rules. To the best of our knowledge, this is the first hardware solution capable of maintaining rulesets of this size. We present this work as an exploration of the application of HMCs to packet processing and as a first step in achieving a processing capability of a million rules at usable line-rates.
{"title":"Packet Matching on FPGAs Using HMC Memory: Towards One Million Rules","authors":"Daniel Rozhko, Geoffrey Elliott, D. Ly-Ma, P. Chow, H. Jacobsen","doi":"10.1145/3020078.3021752","DOIUrl":"https://doi.org/10.1145/3020078.3021752","url":null,"abstract":"Packet processing systems increasingly need larger rulesets to satisfy the needs of deep-network intrusion prevention and cluster computing. FPGA-based implementations of packet processing systems have been proposed but their use of on-chip memory limits the number of rules these existing systems can maintain. Off-chip memories have traditionally been too slow to enable meaningful processing rates, but in this work we present a packet processing system that utilizes the much faster Hybrid Memory Cube (HMC) technology, enabling larger rulesets at usable line-rates. The proposed architecture streams rules from the HMC memory to a packet matching engine, using prefetching to hide the HMC access latency. The packet matching engine is replicated to process multiple packets in parallel. The final system, implemented on a Xilinx Kintex Ultrascale 060, processes 160 packets in parallel, achieving a 10~Gbps line-rate with approximately 1500 rules and a 16~Mbps line-rate with 1M rules. To the best of our knowledge, this is the first hardware solution capable of maintaining rulesets of this size. We present this work as an exploration of the application of HMCs to packet processing and as a first step in achieving a processing capability of a million rules at usable line-rates.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125713999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the "known recipes" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. In this paper we propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework to minimize the circuit area for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that our approach achieves promising area improvement over a set of commonly used benchmarks. Notably, PIMap reduces the LUT usage by up to 14% and 7% on average over the best-known records for the EPFL arithmetic benchmark suite.
{"title":"A Parallelized Iterative Improvement Approach to Area Optimization for LUT-Based Technology Mapping","authors":"Gai Liu, Zhiru Zhang","doi":"10.1145/3020078.3021735","DOIUrl":"https://doi.org/10.1145/3020078.3021735","url":null,"abstract":"Modern FPGA synthesis tools typically apply a predetermined sequence of logic optimizations on the input logic network before carrying out technology mapping. While the \"known recipes\" of logic transformations often lead to improved mapping results, there remains a nontrivial gap between the quality metrics driving the pre-mapping logic optimizations and those targeted by the actual technology mapping. Needless to mention, such miscorrelations would eventually result in suboptimal quality of results. In this paper we propose PIMap, which couples logic transformations and technology mapping under an iterative improvement framework to minimize the circuit area for LUT-based FPGAs. In each iteration, PIMap randomly proposes a transformation on the given logic network from an ensemble of candidate optimizations; it then invokes technology mapping and makes use of the mapping result to determine the likelihood of accepting the proposed transformation. To mitigate the runtime overhead, we further introduce parallelization techniques to decompose a large design into multiple smaller sub-netlists that can be optimized simultaneously. Experimental results show that our approach achieves promising area improvement over a set of commonly used benchmarks. Notably, PIMap reduces the LUT usage by up to 14% and 7% on average over the best-known records for the EPFL arithmetic benchmark suite.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129186076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hans Giesen, Raphael Rubin, Benjamin Gojman, A. DeHon
How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated that the negative effects of variation can be largely mitigated using complete knowledge of device characteristics and full per-FPGA CAD flow. However, the cost of per-FPGA characterization and mapping could be prohibitively expensive. We explore light-weight options for per-FPGA mapping that avoid the need for a priori device characterization and perform less expensive per FPGA customization work. We characterize the tradeoff between Quality-of-Results (energy, delay) and per-device mapping costs for 7 design points ranging from complete mapping based on knowledge to no per-device mapping. We show that it is possible to get 48-77% of the component-specific mapping delay benefit or 57% of the energy benefit with a mapping that takes less than 20 seconds per FPGA. An incremental solution can start execution after a 21 ms bitstream load and converge to 77% delay benefit after 18 seconds of runtime.
{"title":"Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays","authors":"Hans Giesen, Raphael Rubin, Benjamin Gojman, A. DeHon","doi":"10.1145/3020078.3026124","DOIUrl":"https://doi.org/10.1145/3020078.3026124","url":null,"abstract":"How should we perform component-specific adaptation for FPGAs? Prior work has demonstrated that the negative effects of variation can be largely mitigated using complete knowledge of device characteristics and full per-FPGA CAD flow. However, the cost of per-FPGA characterization and mapping could be prohibitively expensive. We explore light-weight options for per-FPGA mapping that avoid the need for a priori device characterization and perform less expensive per FPGA customization work. We characterize the tradeoff between Quality-of-Results (energy, delay) and per-device mapping costs for 7 design points ranging from complete mapping based on knowledge to no per-device mapping. We show that it is possible to get 48-77% of the component-specific mapping delay benefit or 57% of the energy benefit with a mapping that takes less than 20 seconds per FPGA. An incremental solution can start execution after a 21 ms bitstream load and converge to 77% delay benefit after 18 seconds of runtime.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132758465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There have been ample successful examples of applying Xilinx Vivado's "function-to-module" high-level synthesis (HLS) where the subject is algorithmic in nature. In this work, we carried out a design study to assess the effectiveness of applying Vivado-HLS in structural design. We employed Vivado-HLS to synthesize C functions corresponding to standalone network-on-chip (NoC) routers as well as complete multi-endpoint NoCs. Interestingly, we find that describing a complete NoC comprising router submodules faces fundamental difficulties not present in describing the routers as standalone modules. Ultimately, we succeeded in using Vivado-HLS to produce router and NoC modules that are exact cycle- and bit-accurate replacements of our reference RTL-based router and NoC modules. Furthermore, the routers and NoCs resulting from HLS and RTL are comparable in resource utilization and critical path delay. Our experience subjectively suggests that HLS is able to simplify the design effort even though much of the structural details had to be provided in the HLS description through a combination of coding discipline and explicit pragmas. The C++ source code and a more extensive description of this work can be found at http://www.ece.cmu.edu/calcm/connect_hls.
{"title":"Using Vivado-HLS for Structural Design: a NoC Case Study (Abstract Only)","authors":"Zhipeng Zhao, J. Hoe","doi":"10.1145/3020078.3021772","DOIUrl":"https://doi.org/10.1145/3020078.3021772","url":null,"abstract":"There have been ample successful examples of applying Xilinx Vivado's \"function-to-module\" high-level synthesis (HLS) where the subject is algorithmic in nature. In this work, we carried out a design study to assess the effectiveness of applying Vivado-HLS in structural design. We employed Vivado-HLS to synthesize C functions corresponding to standalone network-on-chip (NoC) routers as well as complete multi-endpoint NoCs. Interestingly, we find that describing a complete NoC comprising router submodules faces fundamental difficulties not present in describing the routers as standalone modules. Ultimately, we succeeded in using Vivado-HLS to produce router and NoC modules that are exact cycle- and bit-accurate replacements of our reference RTL-based router and NoC modules. Furthermore, the routers and NoCs resulting from HLS and RTL are comparable in resource utilization and critical path delay. Our experience subjectively suggests that HLS is able to simplify the design effort even though much of the structural details had to be provided in the HLS description through a combination of coding discipline and explicit pragmas. The C++ source code and a more extensive description of this work can be found at http://www.ece.cmu.edu/calcm/connect_hls.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133555116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, Zhiru Zhang
Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.
{"title":"A Parallel Bandit-Based Approach for Autotuning FPGA Compilation","authors":"Chang Xu, Gai Liu, Ritchie Zhao, Stephen Yang, Guojie Luo, Zhiru Zhang","doi":"10.1145/3020078.3021747","DOIUrl":"https://doi.org/10.1145/3020078.3021747","url":null,"abstract":"Mainstream FPGA CAD tools provide an extensive collection of optimization options that have a significant impact on the quality of the final design. These options together create an enormous and complex design space that cannot effectively be explored by human effort alone. Instead, we propose to search this parameter space using autotuning, which is a popular approach in the compiler optimization domain. Specifically, we study the effectiveness of applying the multi-armed bandit (MAB) technique to automatically tune the options for a complete FPGA compilation flow from RTL to bitstream, including RTL/logic synthesis, technology mapping, placement, and routing. To mitigate the high runtime cost incurred by the complex FPGA implementation process, we devise an efficient parallelization scheme that enables multiple MAB-based autotuners to explore the design space simultaneously. In particular, we propose a dynamic solution space partitioning and resource allocation technique that intelligently allocates computing resources to promising search regions based on the runtime information of search quality from previous iterations. Experiments on academic and commercial FPGA CAD tools demonstrate promising improvements in quality and convergence rate across a variety of real-life designs.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131040699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}