Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082796
Brahim Al Farisi, Karel Heyse, D. Stroobandt
A multi-mode circuit implements the functionality of a limited number of circuits, called modes, of which at any given time only one needs to be realised. Using dynamic partial reconfiguration of an FPGA, all the modes can be implemented on the same reconfigurable region, requiring only an area that can contain the biggest mode. This can save considerable chip area. Conventional dynamic partial reconfiguration techniques generate a configuration for every mode separately. As a result, to switch between modes the complete reconfigurable region is rewritten, which often leads to long reconfiguration times. In this paper we give an overview of research we conducted to reduce this overhead of dynamic partial reconfiguration for multi-mode circuits. In this research we explored several joint optimization strategies at different stages of the tool flow.
{"title":"Reducing the overhead of dynamic partial reconfiguration for multi-mode circuits","authors":"Brahim Al Farisi, Karel Heyse, D. Stroobandt","doi":"10.1109/FPT.2014.7082796","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082796","url":null,"abstract":"A multi-mode circuit implements the functionality of a limited number of circuits, called modes, of which at any given time only one needs to be realised. Using dynamic partial reconfiguration of an FPGA, all the modes can be implemented on the same reconfigurable region, requiring only an area that can contain the biggest mode. This can save considerable chip area. Conventional dynamic partial reconfiguration techniques generate a configuration for every mode separately. As a result, to switch between modes the complete reconfigurable region is rewritten, which often leads to long reconfiguration times. In this paper we give an overview of research we conducted to reduce this overhead of dynamic partial reconfiguration for multi-mode circuits. In this research we explored several joint optimization strategies at different stages of the tool flow.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"32 1","pages":"282-283"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78802435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082785
Server Kasap, Soydan Redif
In this paper, we introduce a novel reconfigurable hardware architecture for computing the polynomial matrix multiplication (PMM) of polynomial matrices/vectors. The proposed algorithm exploits an extension of the fast convolution technique to multiple-input, multiple-output (MIMO) systems. The proposed architecture is the first one devoted to the hardware implementation of PMM. Hardware implementation of the algorithm is achieved via a highly pipelined, partly systolic FPGA architecture. We verify the algorithmic accuracy of the architecture, which is scalable in terms of the order of the input matrices, through FPGA-in-the-loop hardware co-simulations. Results are presented to demonstrate the accuracy and capability of the architecture.
{"title":"Novel reconfigurable hardware implementation of polynomial matrix/vector multiplications","authors":"Server Kasap, Soydan Redif","doi":"10.1109/FPT.2014.7082785","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082785","url":null,"abstract":"In this paper, we introduce a novel reconfigurable hardware architecture for computing the polynomial matrix multiplication (PMM) of polynomial matrices/vectors. The proposed algorithm exploits an extension of the fast convolution technique to multiple-input, multiple-output (MIMO) systems. The proposed architecture is the first one devoted to the hardware implementation of PMM. Hardware implementation of the algorithm is achieved via a highly pipelined, partly systolic FPGA architecture. We verify the algorithmic accuracy of the architecture, which is scalable in terms of the order of the input matrices, through FPGA-in-the-loop hardware co-simulations. Results are presented to demonstrate the accuracy and capability of the architecture.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"18 1","pages":"243-247"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83748456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082792
Susumu Mashimo, M. Amagasaki, M. Iida, M. Kuga, T. Sueyoshi
High performance is required of many Android systems because embedded systems written for this operating system are used in several fields and rely on increasingly complicated processing. To accommodate this, we present a software/hardware (SW/HW) coprocessing platform implemented on a programmable system-on-a-chip (Xilinx Inc.: Zynq). This platform provides a unified architecture, extended OS kernel, application framework, and application distribution model to simplify the development and use of Android SW/HW coprocessing applications.
{"title":"Zyndroid: An Android platform for software/hardware coprocessing","authors":"Susumu Mashimo, M. Amagasaki, M. Iida, M. Kuga, T. Sueyoshi","doi":"10.1109/FPT.2014.7082792","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082792","url":null,"abstract":"High performance is required of many Android systems because embedded systems written for this operating system are used in several fields and rely on increasingly complicated processing. To accommodate this, we present a software/hardware (SW/HW) coprocessing platform implemented on a programmable system-on-a-chip (Xilinx Inc.: Zynq). This platform provides a unified architecture, extended OS kernel, application framework, and application distribution model to simplify the development and use of Android SW/HW coprocessing applications.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"393 1","pages":"272-275"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73016197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082818
Yu Fujita, K. Masuyama, H. Amano
Cool mega array with silicon on thin box (CMA-SOTB) is an extremely low power coarse grained reconfigurable accelerator. It was implemented by using the SOTB technology developed by a Japanese national project, low-power electronics association & project (LEAP). Making the best use of such a device and low energy architectural techniques, CMA-SOTB works more than 25MHz clock with less than 0.3V supply voltage. Various kind of optimization can be done by controlling the body bias voltage for PE array and micro-controller independently. The demonstration using CMA-SOTB first shows that a simple image processing application can work with a 0.25V-0.4V solar battery. Then the leakage power control by changing the body bias is demonstrated. In the stand-by mode, less than 20μW power is consumed by using strong reverse bias.
{"title":"Image processing by A 0.3V 2MW coarse-grained reconfigurable accelerator CMA-SOTB with a solar battery","authors":"Yu Fujita, K. Masuyama, H. Amano","doi":"10.1109/FPT.2014.7082818","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082818","url":null,"abstract":"Cool mega array with silicon on thin box (CMA-SOTB) is an extremely low power coarse grained reconfigurable accelerator. It was implemented by using the SOTB technology developed by a Japanese national project, low-power electronics association & project (LEAP). Making the best use of such a device and low energy architectural techniques, CMA-SOTB works more than 25MHz clock with less than 0.3V supply voltage. Various kind of optimization can be done by controlling the body bias voltage for PE array and micro-controller independently. The demonstration using CMA-SOTB first shows that a simple image processing application can work with a 0.25V-0.4V solar battery. Then the leakage power control by changing the body bias is demonstrated. In the stand-by mode, less than 20μW power is consumed by using strong reverse bias.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"42 1","pages":"354-357"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74458699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082821
Hossein Borhanifar, Seyed Peyman Zolnouri
In this paper, a solution for Blokus Duo game is presented using minmax algorithm. Then, Alpha-beta pruning method is implemented on the algorithm to reduce playing time in the manner that its running speed increases significantly. Moreover, Pentobi® software as a criterion is used as a competitor and the final results for that are reported. All codes are directly written on a hardware basis using VHDL language. After being synthesized by Quartus II®, the result is implemented on DE1-SOC board which uses cyclone V FPGA.
本文提出了一种利用极小极大算法求解方块对弈问题的方法。然后,对算法实施Alpha-beta剪枝方法,使算法运行速度显著提高,从而减少算法的播放时间。此外,Pentobi®软件作为标准被用作竞争对手,并报告了最终结果。所有代码都是使用VHDL语言直接在硬件基础上编写的。结果由Quartus II®合成后,在使用cyclone V FPGA的DE1-SOC板上实现。
{"title":"Optimize MinMax algorithm to solve Blokus Duo game by HDL","authors":"Hossein Borhanifar, Seyed Peyman Zolnouri","doi":"10.1109/FPT.2014.7082821","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082821","url":null,"abstract":"In this paper, a solution for Blokus Duo game is presented using minmax algorithm. Then, Alpha-beta pruning method is implemented on the algorithm to reduce playing time in the manner that its running speed increases significantly. Moreover, Pentobi® software as a criterion is used as a competitor and the final results for that are reported. All codes are directly written on a hardware basis using VHDL language. After being synthesized by Quartus II®, the result is implemented on DE1-SOC board which uses cyclone V FPGA.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"322 1","pages":"362-365"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76462085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082817
Jiahua Chen, Tao Wang, Haoyang Wu, Jian Gong, Xiaoguang Li, Yang Hu, Gaohan Zhang, Zhiwei Li, Junrui Yang, Songwu Lu
The ongoing mobile Internet revolution calls for quick adoptions of new wireless communication and networking technologies. To enable such fast innovations, a software-defined platform is needed to validate and refine new algorithms, protocols, and architectures in communications and networking. Unfortunately, no current systems can meet both requirements of high programmability and high performance. In this work, we report our recent effort on building such a reconfigurable platform. We show that our proposed platform, GRT, can support both high-performance and high-programmability in a unified framework. Moreover, GRT is seamlessly integrated into the standard TCP/IP network protocol stack under Linux, and can act as a WiFi-capable, network interface card. Furthermore, it ensures backward compatibility with the popular GNU Radio platform, a user-friendly, yet low-performance system. In the demo, we will demonstrate the full functionalities of the 802.11a/g WiFi on GRT, including (1) wireless file transfer between two GRT systems at the speed of tens of Mbps; (2) execution of default Linux TCP/IP applications without changes (e.g. SSH); (3) access point (AP) operation mode, where commodity WiFi devices access the Internet via the GRT-converted AP over the WiFi channel.
正在进行的移动互联网革命要求快速采用新的无线通信和网络技术。为了实现如此快速的创新,需要一个软件定义的平台来验证和改进通信和网络中的新算法、协议和体系结构。遗憾的是,目前还没有一个系统能够同时满足高可编程性和高性能的要求。在这项工作中,我们报告了我们最近在构建这样一个可重构平台上所做的努力。我们证明了我们提出的平台GRT可以在一个统一的框架中同时支持高性能和高可编程性。此外,GRT可以无缝地集成到Linux下的标准TCP/IP网络协议栈中,并且可以充当具有wifi功能的网络接口卡。此外,它确保了与流行的GNU Radio平台的向后兼容性,这是一个用户友好但性能较低的系统。在演示中,我们将展示802.11a/g WiFi在GRT上的全部功能,包括(1)在两个GRT系统之间以数十Mbps的速度进行无线文件传输;(2)执行默认的Linux TCP/IP应用程序,无需更改(例如SSH);(3)接入点(access point, AP)运行模式,即商品WiFi设备通过WiFi通道,通过grt转换后的AP接入互联网。
{"title":"A high-performance and high-programmability reconfigurable wireless development platform","authors":"Jiahua Chen, Tao Wang, Haoyang Wu, Jian Gong, Xiaoguang Li, Yang Hu, Gaohan Zhang, Zhiwei Li, Junrui Yang, Songwu Lu","doi":"10.1109/FPT.2014.7082817","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082817","url":null,"abstract":"The ongoing mobile Internet revolution calls for quick adoptions of new wireless communication and networking technologies. To enable such fast innovations, a software-defined platform is needed to validate and refine new algorithms, protocols, and architectures in communications and networking. Unfortunately, no current systems can meet both requirements of high programmability and high performance. In this work, we report our recent effort on building such a reconfigurable platform. We show that our proposed platform, GRT, can support both high-performance and high-programmability in a unified framework. Moreover, GRT is seamlessly integrated into the standard TCP/IP network protocol stack under Linux, and can act as a WiFi-capable, network interface card. Furthermore, it ensures backward compatibility with the popular GNU Radio platform, a user-friendly, yet low-performance system. In the demo, we will demonstrate the full functionalities of the 802.11a/g WiFi on GRT, including (1) wireless file transfer between two GRT systems at the speed of tens of Mbps; (2) execution of default Linux TCP/IP applications without changes (e.g. SSH); (3) access point (AP) operation mode, where commodity WiFi devices access the Internet via the GRT-converted AP over the WiFi channel.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"350-353"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78628658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082804
Niranjan S. Kulkarni, Jinghua Yang, S. Vrudhula
Threshold-logic gates have long been known to result in more compact and faster circuits when compared to conventional AND/OR logic equivalents [1], However, threshold logic based design has not entered the mainstream design technology (neither custom ASIC nor FPGA) due to the lack of efficient and reliable gate implementations and the necessary infrastructure for automated synthesis and physical design. This paper is a step toward addressing this gap. We present the architecture of a novel programmable logic array, referred to as Field Programmable Threshold-Logic Array (FPTLA), in which the basic cells are differential mode threshold-logic gates (DTGs). Each individual DTG cell is a clock edge-triggered circuit that computes a threshold-logic function. A DTG can be programmed to implement different threshold logic functions by routing appropriate signals to their inputs. This reduces the number of SRAMs inside the logic blocks by about 60% compared to conventional CLBs, without adding any significant overhead in the routing infrastructure. Since a DTG is essentially a multi-input, edge-triggered flipflop that computes a threshold function, a network of DTGs forms a nano-pipelined circuit. The advantages of such a network are demonstrated on a set of deeply pipelined datapath circuits implemented on FPTLAs and conventional FPGAs using the well established FPGA design framework VTR (Verilog To Routing) and VPR (Versatile Place and Route) [2]. The results indicate that an FPTLA can achieve up to 2X improvement in delay for nearly the same energy and logic area compared to the conventional LUT based FPGA. Although differential mode circuits can potentially be more sensitive to process variations, FPTLAs can be made robust to such variations without sacrificing their improved energy efficiency and performance over FPGAs.
与传统的and /OR逻辑等效物[1]相比,阈值逻辑门长期以来一直被认为可以产生更紧凑和更快的电路,然而,由于缺乏高效可靠的门实现以及自动化合成和物理设计所需的基础设施,基于阈值逻辑的设计尚未进入主流设计技术(既不是定制ASIC也不是FPGA)。本文是解决这一差距的一步。我们提出了一种新型可编程逻辑阵列的架构,称为现场可编程阈值逻辑阵列(FPTLA),其中基本单元是差分模式阈值逻辑门(dtg)。每个单独的DTG单元是一个时钟边缘触发电路,计算一个阈值逻辑函数。可以对DTG进行编程,通过将适当的信号路由到其输入端来实现不同的阈值逻辑功能。与传统的clb相比,这将逻辑块内的sram数量减少了大约60%,而不会在路由基础设施中增加任何显著的开销。由于DTG本质上是一个计算阈值函数的多输入、边缘触发触发器,因此DTG网络形成了纳米流水线电路。采用成熟的FPGA设计框架VTR (Verilog To Routing)和VPR (Versatile Place and Route)[2],在FPTLAs和传统FPGA上实现了一组深度流水线数据路径电路,证明了这种网络的优势。结果表明,与传统的基于LUT的FPGA相比,在几乎相同的能量和逻辑面积下,FPGA可以实现高达2倍的延迟改进。虽然差模电路可能对工艺变化更敏感,但fptla可以在不牺牲其提高的能效和性能的情况下对这种变化进行鲁棒化。
{"title":"A fast, energy efficient, field programmable threshold-logic array","authors":"Niranjan S. Kulkarni, Jinghua Yang, S. Vrudhula","doi":"10.1109/FPT.2014.7082804","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082804","url":null,"abstract":"Threshold-logic gates have long been known to result in more compact and faster circuits when compared to conventional AND/OR logic equivalents [1], However, threshold logic based design has not entered the mainstream design technology (neither custom ASIC nor FPGA) due to the lack of efficient and reliable gate implementations and the necessary infrastructure for automated synthesis and physical design. This paper is a step toward addressing this gap. We present the architecture of a novel programmable logic array, referred to as Field Programmable Threshold-Logic Array (FPTLA), in which the basic cells are differential mode threshold-logic gates (DTGs). Each individual DTG cell is a clock edge-triggered circuit that computes a threshold-logic function. A DTG can be programmed to implement different threshold logic functions by routing appropriate signals to their inputs. This reduces the number of SRAMs inside the logic blocks by about 60% compared to conventional CLBs, without adding any significant overhead in the routing infrastructure. Since a DTG is essentially a multi-input, edge-triggered flipflop that computes a threshold function, a network of DTGs forms a nano-pipelined circuit. The advantages of such a network are demonstrated on a set of deeply pipelined datapath circuits implemented on FPTLAs and conventional FPGAs using the well established FPGA design framework VTR (Verilog To Routing) and VPR (Versatile Place and Route) [2]. The results indicate that an FPTLA can achieve up to 2X improvement in delay for nearly the same energy and logic area compared to the conventional LUT based FPGA. Although differential mode circuits can potentially be more sensitive to process variations, FPTLAs can be made robust to such variations without sacrificing their improved energy efficiency and performance over FPGAs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"12 1","pages":"300-305"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77196656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082816
Ma Ning, Wang Shaojun, Pang Yeyong, Peng Yu
In recent years, implementing a complicated algorithm in an embedded system, especially in a heterogeneous computing system, has gained more and more attention in many fields. The problem is that the implementation needs amounts of coding and debugging work, even if the algorithm has been verified by high-level language in PC environment. Our demo presents a method which can reduce the time of developing an algorithm in an embedded and heterogeneous system by high level synthesis method. Least Square Support Vector Machine(LS-SVM) algorithm was realized on Zynq platform by translating high-level language to Hardware Description Language(HDL). Basing on the feature of the developed heterogeneous system and the theory of LS-SVM, three parts were implemented to realize LS-SVM which includes a generating Kernel Matrix module, a solving linear equations module and a forecasting module. The first and the third parts have been placed in ARM processor by C language. Moreover, considering that the second parts was compute-intensive, it has been realized in logic resource by using high-level language. To manage data communication and computing task, an SOPC system has been designed on Zynq platform which worked in PXI chassis. Experiments demonstrate that the design method is feasible and can be used for the implementation of other complicate algorithm. The precision and time consumption in computing are given at the end.
{"title":"Implementation of LS-SVM with HLS on Zynq","authors":"Ma Ning, Wang Shaojun, Pang Yeyong, Peng Yu","doi":"10.1109/FPT.2014.7082816","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082816","url":null,"abstract":"In recent years, implementing a complicated algorithm in an embedded system, especially in a heterogeneous computing system, has gained more and more attention in many fields. The problem is that the implementation needs amounts of coding and debugging work, even if the algorithm has been verified by high-level language in PC environment. Our demo presents a method which can reduce the time of developing an algorithm in an embedded and heterogeneous system by high level synthesis method. Least Square Support Vector Machine(LS-SVM) algorithm was realized on Zynq platform by translating high-level language to Hardware Description Language(HDL). Basing on the feature of the developed heterogeneous system and the theory of LS-SVM, three parts were implemented to realize LS-SVM which includes a generating Kernel Matrix module, a solving linear equations module and a forecasting module. The first and the third parts have been placed in ARM processor by C language. Moreover, considering that the second parts was compute-intensive, it has been realized in logic resource by using high-level language. To manage data communication and computing task, an SOPC system has been designed on Zynq platform which worked in PXI chassis. Experiments demonstrate that the design method is feasible and can be used for the implementation of other complicate algorithm. The precision and time consumption in computing are given at the end.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"346-349"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76733763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082762
B. Liebig, A. Koch
With growing FPGA capacities, applications requiring more intensive use of floating-point arithmetic become feasible candidates for acceleration using reconfigurable logic. Still among the more uncommon operations, however, are fast double-precision divider units. Since our application domain (acceleration of custom-compiled convex solvers) heavily relies on these blocks, we have implemented low-latency dividers based on the Goldschmidt algorithm that are accurate up to 1 bit of least precision (1-ULP). On Virtex-6 devices, our units operate at 200 MHz and significantly outperform other state-of-the-art 1-ULP dividers. We evaluate our blocks both stand-alone, as well as on the application-level when used for the high-level synthesis of the convex solver cores.
{"title":"Low-latency double-precision floating-point division for FPGAs","authors":"B. Liebig, A. Koch","doi":"10.1109/FPT.2014.7082762","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082762","url":null,"abstract":"With growing FPGA capacities, applications requiring more intensive use of floating-point arithmetic become feasible candidates for acceleration using reconfigurable logic. Still among the more uncommon operations, however, are fast double-precision divider units. Since our application domain (acceleration of custom-compiled convex solvers) heavily relies on these blocks, we have implemented low-latency dividers based on the Goldschmidt algorithm that are accurate up to 1 bit of least precision (1-ULP). On Virtex-6 devices, our units operate at 200 MHz and significantly outperform other state-of-the-art 1-ULP dividers. We evaluate our blocks both stand-alone, as well as on the application-level when used for the high-level synthesis of the convex solver cores.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"107-114"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81647312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-12-01DOI: 10.1109/FPT.2014.7082743
J. Cong
Customized computing has been of interest to the research community for over three decades. The interest has intensified in the recent years as the power and energy become a significant limiting factor to the computing industry. For example, the energy consumed by the datacenters of some large internet service provides is well over 109 Kilowatt-hours. FPGA-based acceleration has shown 10–1000X performance/energy efficiency over the general-purpose processors in many applications. However, programming FPGAs as a computing device is still a significant challenge. Most of accelerators are designed using manual RTL coding. The recent progress in high-level synthesis (HLS) has improved the programming productivity considerably where one can quickly implement functional blocks written using high-level programming languages as C or C++ instead of RTL. But in using the HLS tool for accelerated computing, the programmer still faces a lot of design decisions, such as implementation choices of each module and communication schemes between different modules, and has to implement additional logic for data management, such as memory partitioning, data prefetching and reuse. Extensive source code rewriting is often required to achieve high-performance acceleration using the existing HLS tools. In this talk, I shall present the ongoing work at UCLA to enable further automation for customized computing. One effort is on automated compilation to combining source-code level transformation for HLS with efficient parameterized architecture template generations. I shall highlight our progress on loop restructuring and code generation, memory partitioning, data prefetching and reuse, combined module selection, duplication, and scheduling with communication optimization. These techniques allows the programmer to easily compile computation kernels to FPGAs for acceleration. Another direction is to develop efficient runtime support for scheduling and transparent resource management for integration of FPGAs for datacenter-scale acceleration, which is becoming a reality (for example, Microsoft recently used over 1,600 servers with FPGAs for accelerating their search engine and reported very encouraging results). Our runtime system provides scheduling and resource management support at multiple levels, including server node-level, job-level, and datacenter-level so that programmer can make use the existing programming interfaces, such as MapReduce or Hadoop, for large-scale distributed computation.
{"title":"Automating customized computing","authors":"J. Cong","doi":"10.1109/FPT.2014.7082743","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082743","url":null,"abstract":"Customized computing has been of interest to the research community for over three decades. The interest has intensified in the recent years as the power and energy become a significant limiting factor to the computing industry. For example, the energy consumed by the datacenters of some large internet service provides is well over 109 Kilowatt-hours. FPGA-based acceleration has shown 10–1000X performance/energy efficiency over the general-purpose processors in many applications. However, programming FPGAs as a computing device is still a significant challenge. Most of accelerators are designed using manual RTL coding. The recent progress in high-level synthesis (HLS) has improved the programming productivity considerably where one can quickly implement functional blocks written using high-level programming languages as C or C++ instead of RTL. But in using the HLS tool for accelerated computing, the programmer still faces a lot of design decisions, such as implementation choices of each module and communication schemes between different modules, and has to implement additional logic for data management, such as memory partitioning, data prefetching and reuse. Extensive source code rewriting is often required to achieve high-performance acceleration using the existing HLS tools. In this talk, I shall present the ongoing work at UCLA to enable further automation for customized computing. One effort is on automated compilation to combining source-code level transformation for HLS with efficient parameterized architecture template generations. I shall highlight our progress on loop restructuring and code generation, memory partitioning, data prefetching and reuse, combined module selection, duplication, and scheduling with communication optimization. These techniques allows the programmer to easily compile computation kernels to FPGAs for acceleration. Another direction is to develop efficient runtime support for scheduling and transparent resource management for integration of FPGAs for datacenter-scale acceleration, which is becoming a reality (for example, Microsoft recently used over 1,600 servers with FPGAs for accelerating their search engine and reported very encouraging results). Our runtime system provides scheduling and resource management support at multiple levels, including server node-level, job-level, and datacenter-level so that programmer can make use the existing programming interfaces, such as MapReduce or Hadoop, for large-scale distributed computation.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87915682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}