首页 > 最新文献

2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
A flexible interface architecture for reconfigurable coprocessors in embedded multicore systems using PCIe Single-root I/O virtualization 使用PCIe单根I/O虚拟化的嵌入式多核系统中可重构协处理器的灵活接口架构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082780
O. Sander, S. Bähr, Enno Lübbers, T. Sandmann, Viet Vu Duy, J. Becker
Especially in complex system-of-systems scenarios, where multiple high-performance or real-time processing functions need to co-exist and interact, reconfigurable devices together with virtualization techniques show considerable promise to increase efficiency, ease integration and maintain functional and non-functional properties of the individual functions. In this paper, we propose a flexible interface architecture with low overhead for coupling reconfigurable coprocessors to high-performance general-purpose processors, allowing customized yet efficient construction of heterogeneous processing systems. Our implementation is based on PCI Express (PCIe) and optimized for virtualized systems, taking advantage of the SR-IOV capabilities in modern PCIe implementations. We describe the interface architecture and its fundamental technologies, detail the services provided to individual coprocessors and accelerator modules, and quantify key corner performance indicators relevant for virtualized applications.
特别是在复杂的系统的场景中,多个高性能或实时处理功能需要共存和交互,可重构设备与虚拟化技术一起显示出相当大的希望,以提高效率,简化集成和维护单个功能的功能和非功能属性。在本文中,我们提出了一种低开销的灵活接口架构,用于将可重构协处理器耦合到高性能通用处理器,从而允许定制而高效地构建异构处理系统。我们的实现基于PCI Express (PCIe)并针对虚拟化系统进行了优化,充分利用了现代PCIe实现中的SR-IOV功能。我们描述了接口架构及其基本技术,详细介绍了为单个协处理器和加速器模块提供的服务,并量化了与虚拟化应用相关的关键拐角性能指标。
{"title":"A flexible interface architecture for reconfigurable coprocessors in embedded multicore systems using PCIe Single-root I/O virtualization","authors":"O. Sander, S. Bähr, Enno Lübbers, T. Sandmann, Viet Vu Duy, J. Becker","doi":"10.1109/FPT.2014.7082780","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082780","url":null,"abstract":"Especially in complex system-of-systems scenarios, where multiple high-performance or real-time processing functions need to co-exist and interact, reconfigurable devices together with virtualization techniques show considerable promise to increase efficiency, ease integration and maintain functional and non-functional properties of the individual functions. In this paper, we propose a flexible interface architecture with low overhead for coupling reconfigurable coprocessors to high-performance general-purpose processors, allowing customized yet efficient construction of heterogeneous processing systems. Our implementation is based on PCI Express (PCIe) and optimized for virtualized systems, taking advantage of the SR-IOV capabilities in modern PCIe implementations. We describe the interface architecture and its fundamental technologies, detail the services provided to individual coprocessors and accelerator modules, and quantify key corner performance indicators relevant for virtualized applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"25 1","pages":"223-226"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79645248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
HW acceleration of multiple applications on a single FPGA 在单个FPGA上实现多个应用的硬件加速
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082797
Yidi Liu, Benjamin Carrión Schäfer
This works presents a fast and efficient method to map multiple computationally intensive kernels onto the same FPGA given the FPGA area and communication bandwidth constraint. FPGAs have grown to a size where multiple applications can now be mapped onto a single device. It is therefore important to develop methods than can efficiently decide which kernels of all of the applications under consideration should be mapped onto the FPGA in order to maximize the total system acceleration. Our method shows very good results compared to a standard genetic algorithm, which is often used for multi-objective optimization problems and against the optimal solution obtained using an exhaustive search method. Experimental results show that our method is very scalable and extremely fast.
在给定FPGA面积和通信带宽限制的情况下,提出了一种快速有效地将多个计算密集型内核映射到同一FPGA上的方法。fpga已经发展到可以将多个应用程序映射到单个器件上的规模。因此,重要的是开发一种方法,能够有效地决定所有应用程序的哪些内核应该映射到FPGA上,以最大限度地提高系统的总加速度。与标准遗传算法相比,我们的方法显示出非常好的结果,标准遗传算法通常用于多目标优化问题,并针对使用穷举搜索方法获得的最优解。实验结果表明,该方法具有很强的可扩展性和极快的速度。
{"title":"HW acceleration of multiple applications on a single FPGA","authors":"Yidi Liu, Benjamin Carrión Schäfer","doi":"10.1109/FPT.2014.7082797","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082797","url":null,"abstract":"This works presents a fast and efficient method to map multiple computationally intensive kernels onto the same FPGA given the FPGA area and communication bandwidth constraint. FPGAs have grown to a size where multiple applications can now be mapped onto a single device. It is therefore important to develop methods than can efficiently decide which kernels of all of the applications under consideration should be mapped onto the FPGA in order to maximize the total system acceleration. Our method shows very good results compared to a standard genetic algorithm, which is often used for multi-objective optimization problems and against the optimal solution obtained using an exhaustive search method. Experimental results show that our method is very scalable and extremely fast.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"28 12 1","pages":"284-285"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87667772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating transfer entropy computation 加速传递熵计算
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082754
Shengjia Shao, Ce Guo, W. Luk, Stephen Weston
Transfer entropy is a measure of information transfer between two time series. It is an asymmetric measure based on entropy change which only takes into account the statistical dependency originating in the source series, but excludes dependency on a common external factor. Transfer entropy is able to capture system dynamics that traditional measures cannot, and has been successfully applied to various areas such as neuroscience, bioinformatics, data mining and finance. When time series becomes longer and resolution becomes higher, computing transfer entropy is demanding. This paper presents the first reconfigurable computing solution to accelerate transfer entropy computation. The novel aspects of our approach include a new technique based on Laplace's Rule of Succession for probability estimation; a novel architecture with optimised memory allocation, bit-width narrowing and mixed-precision optimisation; and its implementation targeting a Xilinx Virtex-6 SX475T FPGA. In our experiments, the proposed FPGA-based solution is up to 111.47 times faster than one Xeon CPU core, and 18.69 times faster than a 6-core Xeon CPU.
传递熵是两个时间序列之间信息传递的度量。它是一种基于熵变的非对称度量,只考虑源序列的统计依赖性,而不考虑对共同外部因素的依赖性。传递熵能够捕捉到传统方法无法捕捉到的系统动态,并已成功地应用于神经科学、生物信息学、数据挖掘和金融等各个领域。当时间序列变长、分辨率变高时,对传递熵的计算提出了更高的要求。本文提出了第一个可重构计算方案,以加速传递熵的计算。我们的方法的新颖方面包括基于拉普拉斯演替规则的概率估计的新技术;一种具有优化内存分配、位宽缩小和混合精度优化的新架构;以及针对Xilinx Virtex-6 SX475T FPGA的实现。在我们的实验中,提出的基于fpga的解决方案比一个Xeon CPU核心快111.47倍,比6核Xeon CPU快18.69倍。
{"title":"Accelerating transfer entropy computation","authors":"Shengjia Shao, Ce Guo, W. Luk, Stephen Weston","doi":"10.1109/FPT.2014.7082754","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082754","url":null,"abstract":"Transfer entropy is a measure of information transfer between two time series. It is an asymmetric measure based on entropy change which only takes into account the statistical dependency originating in the source series, but excludes dependency on a common external factor. Transfer entropy is able to capture system dynamics that traditional measures cannot, and has been successfully applied to various areas such as neuroscience, bioinformatics, data mining and finance. When time series becomes longer and resolution becomes higher, computing transfer entropy is demanding. This paper presents the first reconfigurable computing solution to accelerate transfer entropy computation. The novel aspects of our approach include a new technique based on Laplace's Rule of Succession for probability estimation; a novel architecture with optimised memory allocation, bit-width narrowing and mixed-precision optimisation; and its implementation targeting a Xilinx Virtex-6 SX475T FPGA. In our experiments, the proposed FPGA-based solution is up to 111.47 times faster than one Xeon CPU core, and 18.69 times faster than a 6-core Xeon CPU.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"60-67"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89142750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Logic emulation in the megaLUT era - Moore's Law beats Rent's Rule 超级计算机时代的逻辑仿真——摩尔定律打败了“租金法则”
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082742
M. Butts
Throughout its twenty-five year history, logic emulation architectures have been governed by Rent's Rule. This empirical observation, first used to build 1960s mainframes, predicts the average number of cut nets that result when a digital module is arbitrarily partitioned into multiple parts, such as the FPGAs of a logic emulator. A fundamental advantage of emulation is that, unlike most devices, FPGAs always grow in capacity according to Moore's Law, just as the designs to be emulated have grown. Unfortunately packaging technology advances at a far slower pace, leaving emulators short on the pins demanded by Rent's Rule. Many cut nets are now sent through each package pin, which costs speed, power and area. At today's system-on-chip level of design, the number of system-level modules is growing, while their sizes are remaining constant. In the meantime, FPGAs have grown from a handful of logic lookup tables (LUTs) at the beginning to over a million LUTs today. At this scale, an entire system-level module such as an advanced 64-bit CPU can fit inside a single FPGA. Fewer module-internal nets need be cut, so Rent's Rule constraints are relaxing. Fewer and higher-level cut nets means logic emulation with megaLUT FPGAs is becoming faster, cooler, smaller, cheaper, and more reliable. FPGA's Moore's Law scaling is escaping from Rent's Rule.
在其25年的历史中,逻辑仿真架构一直受到Rent’s Rule的支配。这种经验观察首先用于构建20世纪60年代的大型机,它预测了当一个数字模块被任意划分为多个部分(如逻辑模拟器的fpga)时产生的平均截网数量。仿真的一个基本优势是,与大多数设备不同,fpga的容量总是根据摩尔定律增长,就像被仿真的设计不断增长一样。不幸的是,封装技术的进步速度要慢得多,使得模拟器无法满足Rent’s Rule所要求的引脚。许多切割网现在通过每个封装引脚发送,这消耗了速度、电力和面积。在当今的片上系统级设计中,系统级模块的数量正在增长,而它们的尺寸却保持不变。与此同时,fpga已经从最初的几个逻辑查找表(lut)发展到今天的100多万个lut。在这种规模下,整个系统级模块(如高级64位CPU)可以装入单个FPGA中。更少的模块内部网络需要被切断,因此租金规则的限制正在放松。更少和更高级别的切断网络意味着使用megaLUT fpga的逻辑仿真变得更快、更冷、更小、更便宜和更可靠。FPGA的摩尔定律缩放正在逃离Rent’s Rule。
{"title":"Logic emulation in the megaLUT era - Moore's Law beats Rent's Rule","authors":"M. Butts","doi":"10.1109/FPT.2014.7082742","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082742","url":null,"abstract":"Throughout its twenty-five year history, logic emulation architectures have been governed by Rent's Rule. This empirical observation, first used to build 1960s mainframes, predicts the average number of cut nets that result when a digital module is arbitrarily partitioned into multiple parts, such as the FPGAs of a logic emulator. A fundamental advantage of emulation is that, unlike most devices, FPGAs always grow in capacity according to Moore's Law, just as the designs to be emulated have grown. Unfortunately packaging technology advances at a far slower pace, leaving emulators short on the pins demanded by Rent's Rule. Many cut nets are now sent through each package pin, which costs speed, power and area. At today's system-on-chip level of design, the number of system-level modules is growing, while their sizes are remaining constant. In the meantime, FPGAs have grown from a handful of logic lookup tables (LUTs) at the beginning to over a million LUTs today. At this scale, an entire system-level module such as an advanced 64-bit CPU can fit inside a single FPGA. Fewer module-internal nets need be cut, so Rent's Rule constraints are relaxing. Fewer and higher-level cut nets means logic emulation with megaLUT FPGAs is becoming faster, cooler, smaller, cheaper, and more reliable. FPGA's Moore's Law scaling is escaping from Rent's Rule.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"36 1","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81317231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A complementary architecture for high-speed true random number generator 一种高速真随机数发生器的互补结构
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082786
Xian-wei Yang, R. Cheung
In this paper, we introduce a novel FPGA-based design for true random number generator (TRNG). It is able to harvest the timing difference caused by the nonuniformity of the Integrated Circuits (ICs) and use it to generate the randomness. Compared with the previous related work, this design uses a complementary scheme that leads to a doubled data rated output. The proposed complementary design has improved entropy and achieved higher throughput. The prototype design has been implemented and verified on a Xilinx Virtex-6 ML605 evaluation board. As a result, the generated random number stream is able to pass the statistical NIST and DIEHARD test suites showing a reliable performance. Meanwhile, it can approach the maximum data rate as 50 Mbps stably.
本文介绍了一种基于fpga的真随机数发生器(TRNG)设计。它能够收集由集成电路(ic)的非均匀性引起的时序差,并利用它来产生随机性。与以往的相关工作相比,本设计采用了一种互补方案,使数据额定输出增加了一倍。所提出的互补设计改进了熵,实现了更高的吞吐量。该原型设计已在Xilinx Virtex-6 ML605评估板上实现并验证。因此,生成的随机数流能够通过统计NIST和DIEHARD测试套件,显示出可靠的性能。同时,它可以稳定地接近50 Mbps的最大数据速率。
{"title":"A complementary architecture for high-speed true random number generator","authors":"Xian-wei Yang, R. Cheung","doi":"10.1109/FPT.2014.7082786","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082786","url":null,"abstract":"In this paper, we introduce a novel FPGA-based design for true random number generator (TRNG). It is able to harvest the timing difference caused by the nonuniformity of the Integrated Circuits (ICs) and use it to generate the randomness. Compared with the previous related work, this design uses a complementary scheme that leads to a doubled data rated output. The proposed complementary design has improved entropy and achieved higher throughput. The prototype design has been implemented and verified on a Xilinx Virtex-6 ML605 evaluation board. As a result, the generated random number stream is able to pass the statistical NIST and DIEHARD test suites showing a reliable performance. Meanwhile, it can approach the maximum data rate as 50 Mbps stably.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"248-251"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76412274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Design re-use for compile time reduction in FPGA high-level synthesis flows 设计重用以减少FPGA高级合成流的编译时间
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082746
Marcel Gort, J. Anderson
High-level synthesis (HLS) raises the level of abstraction for hardware design through the use of software methodologies. An impediment to productivity in HLS flows, however, is the run-time of the back-end toolflow - synthesis, packing, placement and routing - which can take hours or days for the largest designs. We propose a new back-end flow for HLS that makes use of pre-synthesized and placed "macros" for portions of the design, thereby reducing the amount of work to be done by the back-end tools, lowering run-time. A key aspect of our work is an analytical placement algorithm capable of handling large macros whose internal blocks have fixed relative placements, in conjunction with placing the surrounding individual logic blocks. In an experimental study, we consider the impact on run-time and quality-of-results of using macros: 1) in synthesis alone, and 2) in synthesis, packing and placement. Results show that the proposed approach reduces run-time by ~3x, on average, with a negative performance impact of ~5%.
高级综合(HLS)通过使用软件方法提高了硬件设计的抽象层次。然而,HLS流程中生产率的一个障碍是后端工具流的运行时间——合成、打包、放置和路由——对于最大的设计可能需要数小时或数天的时间。我们为HLS提出了一个新的后端流程,它利用预先合成和放置的“宏”来完成部分设计,从而减少了后端工具要完成的工作量,缩短了运行时间。我们工作的一个关键方面是能够处理内部块具有固定相对位置的大型宏的分析放置算法,并结合放置周围的单个逻辑块。在一项实验研究中,我们考虑了使用宏对运行时和结果质量的影响:1)单独在合成中,2)在合成、包装和放置中。结果表明,该方法平均减少了约3倍的运行时间,而对性能的负面影响约为5%。
{"title":"Design re-use for compile time reduction in FPGA high-level synthesis flows","authors":"Marcel Gort, J. Anderson","doi":"10.1109/FPT.2014.7082746","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082746","url":null,"abstract":"High-level synthesis (HLS) raises the level of abstraction for hardware design through the use of software methodologies. An impediment to productivity in HLS flows, however, is the run-time of the back-end toolflow - synthesis, packing, placement and routing - which can take hours or days for the largest designs. We propose a new back-end flow for HLS that makes use of pre-synthesized and placed \"macros\" for portions of the design, thereby reducing the amount of work to be done by the back-end tools, lowering run-time. A key aspect of our work is an analytical placement algorithm capable of handling large macros whose internal blocks have fixed relative placements, in conjunction with placing the surrounding individual logic blocks. In an experimental study, we consider the impact on run-time and quality-of-results of using macros: 1) in synthesis alone, and 2) in synthesis, packing and placement. Results show that the proposed approach reduces run-time by ~3x, on average, with a negative performance impact of ~5%.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"20 1","pages":"4-11"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81440057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Evaluation of SNMP-like protocol to manage a NoC emulation platform 对管理NoC仿真平台的类snmp协议的评估
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082776
O. A. D. L. Junior, V. Fresse, F. Rousseau
The Networks-on-Chip (NoCs) are currently the most appropriate communication structure for many-core embedded systems. An FPGA-based emulation platform can drastically reduce the time needed to evaluate a NoC, even if it is composed by tens or hundreds of distributed components. These components should be timely managed in order to execute an evaluation traffic scenario. There is a lack of standard protocols to drive FPGA-based NoC emulators. Such protocols could ease the integration of emulation components developed by different designers. In this paper, we evaluate a light version of SNMP (Simple Network Management Protocol) to manage an FPGA-based NoC emulation platform. The SNMP protocol and its related components are adapted to a hardware implementation. This facilitates the configuration of the emulation nodes without FPGA-resynthesis, as well as the extraction of emulation results. Some experiments highlight that this protocol is quite simple to implement and very efficient for a light resources overhead.
片上网络(noc)是目前最适合多核嵌入式系统的通信结构。基于fpga的仿真平台可以大大减少评估NoC所需的时间,即使它由数十或数百个分布式组件组成。为了执行评估流量场景,应该及时管理这些组件。目前缺乏驱动基于fpga的NoC仿真器的标准协议。这样的协议可以简化由不同设计者开发的仿真组件的集成。在本文中,我们评估了一个轻量级的SNMP(简单网络管理协议)来管理一个基于fpga的NoC仿真平台。SNMP协议及其相关组件适用于硬件实现。这样可以方便地配置仿真节点而不需要重新合成fpga,也可以方便地提取仿真结果。一些实验表明,该协议非常容易实现,并且对于少量资源开销非常有效。
{"title":"Evaluation of SNMP-like protocol to manage a NoC emulation platform","authors":"O. A. D. L. Junior, V. Fresse, F. Rousseau","doi":"10.1109/FPT.2014.7082776","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082776","url":null,"abstract":"The Networks-on-Chip (NoCs) are currently the most appropriate communication structure for many-core embedded systems. An FPGA-based emulation platform can drastically reduce the time needed to evaluate a NoC, even if it is composed by tens or hundreds of distributed components. These components should be timely managed in order to execute an evaluation traffic scenario. There is a lack of standard protocols to drive FPGA-based NoC emulators. Such protocols could ease the integration of emulation components developed by different designers. In this paper, we evaluate a light version of SNMP (Simple Network Management Protocol) to manage an FPGA-based NoC emulation platform. The SNMP protocol and its related components are adapted to a hardware implementation. This facilitates the configuration of the emulation nodes without FPGA-resynthesis, as well as the extraction of emulation results. Some experiments highlight that this protocol is quite simple to implement and very efficient for a light resources overhead.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"8 1","pages":"199-206"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81493685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Analyzing the impact of heterogeneous blocks on FPGA placement quality 分析异构块对FPGA放置质量的影响
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082750
Chang Xu, Wentai Zhang, Guojie Luo
In this paper we propose a quantitative approach to analyze the impact of heterogeneous blocks (H-blocks) on the FPGA placement quality. The basic idea is to construct synthetic heterogeneous placement benchmarks with known optimal wire-length to facilitate the quantitative analysis. To the best of our knowledge, this is the first work that enables the construction of wirelength-optimal heterogeneous placement examples. Besides analyzing the quality of existing placers, we further decompose the impacts of H-blocks from the architectural aspect and netlist aspect. Our analysis shows that a heterogeneous design hides the wirelength degradation by a more compact netlist than its homogeneous version; however, the heterogeneity results in a optimality gap of 52% in wirelength, where 25% is from architectural heterogeneity and 27% is from netlist heterogeneity. Therefore, new heterogeneous placement algorithms are needed to bridge the optimality gap and improve design quality.
在本文中,我们提出了一种定量的方法来分析异构块(h块)对FPGA放置质量的影响。其基本思想是构建具有已知最优导线长度的综合异构放置基准,以便于定量分析。据我们所知,这是第一个能够构建最佳无线长度异构放置示例的工作。在分析现有砂矿质量的基础上,进一步从建筑层面和网络层面对h块的影响进行分解。我们的分析表明,异构设计通过比同质版本更紧凑的网络表隐藏了带宽退化;然而,这种异构性导致了52%的带宽最优性差距,其中25%来自架构异构,27%来自网表异构。因此,需要新的异构布局算法来弥补最优性差距,提高设计质量。
{"title":"Analyzing the impact of heterogeneous blocks on FPGA placement quality","authors":"Chang Xu, Wentai Zhang, Guojie Luo","doi":"10.1109/FPT.2014.7082750","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082750","url":null,"abstract":"In this paper we propose a quantitative approach to analyze the impact of heterogeneous blocks (H-blocks) on the FPGA placement quality. The basic idea is to construct synthetic heterogeneous placement benchmarks with known optimal wire-length to facilitate the quantitative analysis. To the best of our knowledge, this is the first work that enables the construction of wirelength-optimal heterogeneous placement examples. Besides analyzing the quality of existing placers, we further decompose the impacts of H-blocks from the architectural aspect and netlist aspect. Our analysis shows that a heterogeneous design hides the wirelength degradation by a more compact netlist than its homogeneous version; however, the heterogeneity results in a optimality gap of 52% in wirelength, where 25% is from architectural heterogeneity and 27% is from netlist heterogeneity. Therefore, new heterogeneous placement algorithms are needed to bridge the optimality gap and improve design quality.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"222 1","pages":"36-43"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74947045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Improve memory access for achieving both performance and energy efficiencies on heterogeneous systems 改进内存访问,在异构系统上实现性能和能源效率
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082759
Hongyuan Ding, Miaoqing Huang
Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.
硬件加速器能够为许多应用程序实现显著的性能改进。在这项工作中,我们证明了为加速器提供足够的内存访问带宽对于提高性能和降低能耗至关重要。我们使用尺度不变特征变换(SIFT)算法作为案例研究,其中三个瓶颈阶段在硬件逻辑上加速。基于SIFT算法的不同内存访问模式,设计了两种不同的方法来加速Xilinx Zynq-7045器件上SIFT中的不同功能。在第一种方法中,通过设计完全定制的硬件加速器来加速卷积。在此基础上,分析了三种接口方法。在第二种方法中,建立了一个分布式多处理器硬件系统及其编程模型来处理非连续内存访问。此外,主机处理器上的最后一级缓存(LLC)由所有从服务器共享,以获得更好的性能。在Zynq-7045设备上的实验结果表明,两种方法相结合的混合设计在卷积阶段和SIFT算法的性能提升和能耗降低方面分别比纯软件实现提高了10倍以上。
{"title":"Improve memory access for achieving both performance and energy efficiencies on heterogeneous systems","authors":"Hongyuan Ding, Miaoqing Huang","doi":"10.1109/FPT.2014.7082759","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082759","url":null,"abstract":"Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"30 1","pages":"91-98"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75534152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
FPGA-based high throughput XTS-AES encryption/decryption for storage area network 基于fpga的存储区域网络高吞吐量XTS-AES加密/解密
Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082791
Yi (Estelle) Wang, Akash Kumar, Yajun Ha
The key issue to improve the performance for secure large-scale Storage Area Network (SAN) applications lies in the speed of its encryption/decryption module. Software-based encryption/decryption cannot meet throughput requirements. To solve this problem, we propose a FPGA-based XTS-AES encryption/decryption to suit the needs for secure SAN applications with high throughput requirements. Besides throughput, area optimization is also considered in this proposed design. First, we reuse the same AES encryption to produce the tweak value and unify the operations of AES encryption/decryption in XTS-AES encryption/decryption. Second, we transfer the computations of AES encryption/decryption from GF(28) to GF(24)2, which enables us move the map and the inverse map functions outside the AES round. Third, we propose to support the SubBytes and the inverse SubBytes by the same hardware component. Finally, pipelined registers have been inserted into the proposed unrolled architecture for XTS-AES encryption/decryption. The experiments show that the proposed design achieves 36.2 Gbits/s throughput using 6784 slices on XC6VLX240T FPGA.
提高大规模安全存储区域网络(SAN)应用性能的关键问题在于其加密/解密模块的速度。基于软件的加密/解密无法满足吞吐量要求。为了解决这一问题,我们提出了一种基于fpga的XTS-AES加/解密算法,以满足具有高吞吐量要求的安全SAN应用的需求。除吞吐量外,该设计还考虑了面积优化。首先,我们重用相同的AES加密来产生微调值,并统一了XTS-AES加/解密中AES加/解密的操作。其次,我们将AES加解密的计算从GF(28)转移到GF(24)2,这使我们能够将映射和逆映射函数移到AES轮之外。第三,我们建议用相同的硬件组件支持SubBytes和逆SubBytes。最后,将流水线寄存器插入到提议的展开体系结构中,用于XTS-AES加密/解密。实验表明,该设计在XC6VLX240T FPGA上使用6784片实现了36.2 Gbits/s的吞吐量。
{"title":"FPGA-based high throughput XTS-AES encryption/decryption for storage area network","authors":"Yi (Estelle) Wang, Akash Kumar, Yajun Ha","doi":"10.1109/FPT.2014.7082791","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082791","url":null,"abstract":"The key issue to improve the performance for secure large-scale Storage Area Network (SAN) applications lies in the speed of its encryption/decryption module. Software-based encryption/decryption cannot meet throughput requirements. To solve this problem, we propose a FPGA-based XTS-AES encryption/decryption to suit the needs for secure SAN applications with high throughput requirements. Besides throughput, area optimization is also considered in this proposed design. First, we reuse the same AES encryption to produce the tweak value and unify the operations of AES encryption/decryption in XTS-AES encryption/decryption. Second, we transfer the computations of AES encryption/decryption from GF(28) to GF(24)2, which enables us move the map and the inverse map functions outside the AES round. Third, we propose to support the SubBytes and the inverse SubBytes by the same hardware component. Finally, pipelined registers have been inserted into the proposed unrolled architecture for XTS-AES encryption/decryption. The experiments show that the proposed design achieves 36.2 Gbits/s throughput using 6784 slices on XC6VLX240T FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"322 1","pages":"268-271"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76293414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2014 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1