2014 International Conference on Field-Programmable Technology (FPT)最新文献

英文中文

Reducing the overhead of dynamic partial reconfiguration for multi-mode circuits 减少多模电路动态局部重构的开销

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082796

Brahim Al Farisi, Karel Heyse, D. Stroobandt

A multi-mode circuit implements the functionality of a limited number of circuits, called modes, of which at any given time only one needs to be realised. Using dynamic partial reconfiguration of an FPGA, all the modes can be implemented on the same reconfigurable region, requiring only an area that can contain the biggest mode. This can save considerable chip area. Conventional dynamic partial reconfiguration techniques generate a configuration for every mode separately. As a result, to switch between modes the complete reconfigurable region is rewritten, which often leads to long reconfiguration times. In this paper we give an overview of research we conducted to reduce this overhead of dynamic partial reconfiguration for multi-mode circuits. In this research we explored several joint optimization strategies at different stages of the tool flow.

多模电路实现了有限数量的电路的功能，称为模式，在任何给定的时间只需要实现其中一个。使用FPGA的动态部分重构，所有模式都可以在相同的可重构区域上实现，只需要一个可以包含最大模式的区域。这可以节省相当大的芯片面积。传统的动态部分重配置技术分别为每个模式生成一个配置。因此，为了在模式之间切换，需要重写整个可重构区域，这通常会导致较长的重新配置时间。在本文中，我们概述了我们为减少多模电路的动态部分重构开销而进行的研究。在本研究中，我们探讨了刀具流不同阶段的几种联合优化策略。

引用次数: 2

Novel reconfigurable hardware implementation of polynomial matrix/vector multiplications 新颖的可重构多项式矩阵/矢量乘法的硬件实现

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082785

Server Kasap, Soydan Redif

In this paper, we introduce a novel reconfigurable hardware architecture for computing the polynomial matrix multiplication (PMM) of polynomial matrices/vectors. The proposed algorithm exploits an extension of the fast convolution technique to multiple-input, multiple-output (MIMO) systems. The proposed architecture is the first one devoted to the hardware implementation of PMM. Hardware implementation of the algorithm is achieved via a highly pipelined, partly systolic FPGA architecture. We verify the algorithmic accuracy of the architecture, which is scalable in terms of the order of the input matrices, through FPGA-in-the-loop hardware co-simulations. Results are presented to demonstrate the accuracy and capability of the architecture.

本文介绍了一种新的可重构硬件结构，用于计算多项式矩阵/向量的多项式矩阵乘法。该算法将快速卷积技术扩展到多输入多输出(MIMO)系统。所提出的体系结构是第一个致力于PMM硬件实现的体系结构。该算法的硬件实现是通过高度流水线，部分收缩FPGA架构实现的。我们通过fpga在环硬件联合仿真验证了该架构的算法准确性，该架构可根据输入矩阵的顺序进行扩展。结果表明了该体系结构的准确性和能力。

引用次数: 0

Zyndroid: An Android platform for software/hardware coprocessing Zyndroid:一个用于软件/硬件协同处理的Android平台

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082792

Susumu Mashimo, M. Amagasaki, M. Iida, M. Kuga, T. Sueyoshi

High performance is required of many Android systems because embedded systems written for this operating system are used in several fields and rely on increasingly complicated processing. To accommodate this, we present a software/hardware (SW/HW) coprocessing platform implemented on a programmable system-on-a-chip (Xilinx Inc.: Zynq). This platform provides a unified architecture, extended OS kernel, application framework, and application distribution model to simplify the development and use of Android SW/HW coprocessing applications.

由于为Android操作系统编写的嵌入式系统被用于多个领域，并且依赖于越来越复杂的处理，因此许多Android系统都需要高性能。为了适应这一点，我们提出了一个软件/硬件(SW/HW)协同处理平台，实现在一个可编程的片上系统(Xilinx Inc.: Zynq)。该平台提供了统一的体系结构、扩展的操作系统内核、应用框架和应用分发模型，简化了Android软件/硬件协同处理应用的开发和使用。

引用次数: 0

Image processing by A 0.3V 2MW coarse-grained reconfigurable accelerator CMA-SOTB with a solar battery 采用带有太阳能电池的0.3V 2MW粗粒度可重构加速器CMA-SOTB进行图像处理

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082818

Yu Fujita, K. Masuyama, H. Amano

Cool mega array with silicon on thin box (CMA-SOTB) is an extremely low power coarse grained reconfigurable accelerator. It was implemented by using the SOTB technology developed by a Japanese national project, low-power electronics association & project (LEAP). Making the best use of such a device and low energy architectural techniques, CMA-SOTB works more than 25MHz clock with less than 0.3V supply voltage. Various kind of optimization can be done by controlling the body bias voltage for PE array and micro-controller independently. The demonstration using CMA-SOTB first shows that a simple image processing application can work with a 0.25V-0.4V solar battery. Then the leakage power control by changing the body bias is demonstrated. In the stand-by mode, less than 20μW power is consumed by using strong reverse bias.

超薄硅片阵列(CMA-SOTB)是一种极低功耗的粗粒度可重构加速器。它是通过使用日本国家项目，低功耗电子协会和项目(LEAP)开发的SOTB技术实现的。充分利用这种器件和低能耗架构技术，CMA-SOTB在小于0.3V电源电压下工作在25MHz以上时钟。通过单独控制PE阵列和微控制器的体偏置电压，可以实现各种优化。使用CMA-SOTB的演示首先表明，一个简单的图像处理应用程序可以在0.25V-0.4V太阳能电池上工作。然后演示了通过改变本体偏压来控制泄漏功率的方法。在待机模式下，通过使用强反向偏压，功耗小于20μW。

引用次数: 1

Optimize MinMax algorithm to solve Blokus Duo game by HDL 用HDL优化MinMax算法求解Blokus Duo游戏

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082821

Hossein Borhanifar, Seyed Peyman Zolnouri

In this paper, a solution for Blokus Duo game is presented using minmax algorithm. Then, Alpha-beta pruning method is implemented on the algorithm to reduce playing time in the manner that its running speed increases significantly. Moreover, Pentobi® software as a criterion is used as a competitor and the final results for that are reported. All codes are directly written on a hardware basis using VHDL language. After being synthesized by Quartus II®, the result is implemented on DE1-SOC board which uses cyclone V FPGA.

本文提出了一种利用极小极大算法求解方块对弈问题的方法。然后，对算法实施Alpha-beta剪枝方法，使算法运行速度显著提高，从而减少算法的播放时间。此外，Pentobi®软件作为标准被用作竞争对手，并报告了最终结果。所有代码都是使用VHDL语言直接在硬件基础上编写的。结果由Quartus II®合成后，在使用cyclone V FPGA的DE1-SOC板上实现。

引用次数: 0

A high-performance and high-programmability reconfigurable wireless development platform 一个高性能、高可编程性、可重构的无线开发平台

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082817

Jiahua Chen, Tao Wang, Haoyang Wu, Jian Gong, Xiaoguang Li, Yang Hu, Gaohan Zhang, Zhiwei Li, Junrui Yang, Songwu Lu

The ongoing mobile Internet revolution calls for quick adoptions of new wireless communication and networking technologies. To enable such fast innovations, a software-defined platform is needed to validate and refine new algorithms, protocols, and architectures in communications and networking. Unfortunately, no current systems can meet both requirements of high programmability and high performance. In this work, we report our recent effort on building such a reconfigurable platform. We show that our proposed platform, GRT, can support both high-performance and high-programmability in a unified framework. Moreover, GRT is seamlessly integrated into the standard TCP/IP network protocol stack under Linux, and can act as a WiFi-capable, network interface card. Furthermore, it ensures backward compatibility with the popular GNU Radio platform, a user-friendly, yet low-performance system. In the demo, we will demonstrate the full functionalities of the 802.11a/g WiFi on GRT, including (1) wireless file transfer between two GRT systems at the speed of tens of Mbps; (2) execution of default Linux TCP/IP applications without changes (e.g. SSH); (3) access point (AP) operation mode, where commodity WiFi devices access the Internet via the GRT-converted AP over the WiFi channel.

正在进行的移动互联网革命要求快速采用新的无线通信和网络技术。为了实现如此快速的创新，需要一个软件定义的平台来验证和改进通信和网络中的新算法、协议和体系结构。遗憾的是，目前还没有一个系统能够同时满足高可编程性和高性能的要求。在这项工作中，我们报告了我们最近在构建这样一个可重构平台上所做的努力。我们证明了我们提出的平台GRT可以在一个统一的框架中同时支持高性能和高可编程性。此外，GRT可以无缝地集成到Linux下的标准TCP/IP网络协议栈中，并且可以充当具有wifi功能的网络接口卡。此外，它确保了与流行的GNU Radio平台的向后兼容性，这是一个用户友好但性能较低的系统。在演示中，我们将展示802.11a/g WiFi在GRT上的全部功能，包括(1)在两个GRT系统之间以数十Mbps的速度进行无线文件传输;(2)执行默认的Linux TCP/IP应用程序，无需更改(例如SSH);(3)接入点(access point, AP)运行模式，即商品WiFi设备通过WiFi通道，通过grt转换后的AP接入互联网。

{"title":"A high-performance and high-programmability reconfigurable wireless development platform","authors":"Jiahua Chen, Tao Wang, Haoyang Wu, Jian Gong, Xiaoguang Li, Yang Hu, Gaohan Zhang, Zhiwei Li, Junrui Yang, Songwu Lu","doi":"10.1109/FPT.2014.7082817","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082817","url":null,"abstract":"The ongoing mobile Internet revolution calls for quick adoptions of new wireless communication and networking technologies. To enable such fast innovations, a software-defined platform is needed to validate and refine new algorithms, protocols, and architectures in communications and networking. Unfortunately, no current systems can meet both requirements of high programmability and high performance. In this work, we report our recent effort on building such a reconfigurable platform. We show that our proposed platform, GRT, can support both high-performance and high-programmability in a unified framework. Moreover, GRT is seamlessly integrated into the standard TCP/IP network protocol stack under Linux, and can act as a WiFi-capable, network interface card. Furthermore, it ensures backward compatibility with the popular GNU Radio platform, a user-friendly, yet low-performance system. In the demo, we will demonstrate the full functionalities of the 802.11a/g WiFi on GRT, including (1) wireless file transfer between two GRT systems at the speed of tens of Mbps; (2) execution of default Linux TCP/IP applications without changes (e.g. SSH); (3) access point (AP) operation mode, where commodity WiFi devices access the Internet via the GRT-converted AP over the WiFi channel.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"350-353"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78628658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A fast, energy efficient, field programmable threshold-logic array 一种快速、节能、现场可编程的阈值逻辑阵列

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082804

Niranjan S. Kulkarni, Jinghua Yang, S. Vrudhula

Threshold-logic gates have long been known to result in more compact and faster circuits when compared to conventional AND/OR logic equivalents [1], However, threshold logic based design has not entered the mainstream design technology (neither custom ASIC nor FPGA) due to the lack of efficient and reliable gate implementations and the necessary infrastructure for automated synthesis and physical design. This paper is a step toward addressing this gap. We present the architecture of a novel programmable logic array, referred to as Field Programmable Threshold-Logic Array (FPTLA), in which the basic cells are differential mode threshold-logic gates (DTGs). Each individual DTG cell is a clock edge-triggered circuit that computes a threshold-logic function. A DTG can be programmed to implement different threshold logic functions by routing appropriate signals to their inputs. This reduces the number of SRAMs inside the logic blocks by about 60% compared to conventional CLBs, without adding any significant overhead in the routing infrastructure. Since a DTG is essentially a multi-input, edge-triggered flipflop that computes a threshold function, a network of DTGs forms a nano-pipelined circuit. The advantages of such a network are demonstrated on a set of deeply pipelined datapath circuits implemented on FPTLAs and conventional FPGAs using the well established FPGA design framework VTR (Verilog To Routing) and VPR (Versatile Place and Route) [2]. The results indicate that an FPTLA can achieve up to 2X improvement in delay for nearly the same energy and logic area compared to the conventional LUT based FPGA. Although differential mode circuits can potentially be more sensitive to process variations, FPTLAs can be made robust to such variations without sacrificing their improved energy efficiency and performance over FPGAs.

与传统的and /OR逻辑等效物[1]相比，阈值逻辑门长期以来一直被认为可以产生更紧凑和更快的电路，然而，由于缺乏高效可靠的门实现以及自动化合成和物理设计所需的基础设施，基于阈值逻辑的设计尚未进入主流设计技术(既不是定制ASIC也不是FPGA)。本文是解决这一差距的一步。我们提出了一种新型可编程逻辑阵列的架构，称为现场可编程阈值逻辑阵列(FPTLA)，其中基本单元是差分模式阈值逻辑门(dtg)。每个单独的DTG单元是一个时钟边缘触发电路，计算一个阈值逻辑函数。可以对DTG进行编程，通过将适当的信号路由到其输入端来实现不同的阈值逻辑功能。与传统的clb相比，这将逻辑块内的sram数量减少了大约60%，而不会在路由基础设施中增加任何显著的开销。由于DTG本质上是一个计算阈值函数的多输入、边缘触发触发器，因此DTG网络形成了纳米流水线电路。采用成熟的FPGA设计框架VTR (Verilog To Routing)和VPR (Versatile Place and Route)[2]，在FPTLAs和传统FPGA上实现了一组深度流水线数据路径电路，证明了这种网络的优势。结果表明，与传统的基于LUT的FPGA相比，在几乎相同的能量和逻辑面积下，FPGA可以实现高达2倍的延迟改进。虽然差模电路可能对工艺变化更敏感，但fptla可以在不牺牲其提高的能效和性能的情况下对这种变化进行鲁棒化。

{"title":"A fast, energy efficient, field programmable threshold-logic array","authors":"Niranjan S. Kulkarni, Jinghua Yang, S. Vrudhula","doi":"10.1109/FPT.2014.7082804","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082804","url":null,"abstract":"Threshold-logic gates have long been known to result in more compact and faster circuits when compared to conventional AND/OR logic equivalents [1], However, threshold logic based design has not entered the mainstream design technology (neither custom ASIC nor FPGA) due to the lack of efficient and reliable gate implementations and the necessary infrastructure for automated synthesis and physical design. This paper is a step toward addressing this gap. We present the architecture of a novel programmable logic array, referred to as Field Programmable Threshold-Logic Array (FPTLA), in which the basic cells are differential mode threshold-logic gates (DTGs). Each individual DTG cell is a clock edge-triggered circuit that computes a threshold-logic function. A DTG can be programmed to implement different threshold logic functions by routing appropriate signals to their inputs. This reduces the number of SRAMs inside the logic blocks by about 60% compared to conventional CLBs, without adding any significant overhead in the routing infrastructure. Since a DTG is essentially a multi-input, edge-triggered flipflop that computes a threshold function, a network of DTGs forms a nano-pipelined circuit. The advantages of such a network are demonstrated on a set of deeply pipelined datapath circuits implemented on FPTLAs and conventional FPGAs using the well established FPGA design framework VTR (Verilog To Routing) and VPR (Versatile Place and Route) [2]. The results indicate that an FPTLA can achieve up to 2X improvement in delay for nearly the same energy and logic area compared to the conventional LUT based FPGA. Although differential mode circuits can potentially be more sensitive to process variations, FPTLAs can be made robust to such variations without sacrificing their improved energy efficiency and performance over FPGAs.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"12 1","pages":"300-305"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77196656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Implementation of LS-SVM with HLS on Zynq 基于HLS的LS-SVM在Zynq上的实现

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082816

Ma Ning, Wang Shaojun, Pang Yeyong, Peng Yu

In recent years, implementing a complicated algorithm in an embedded system, especially in a heterogeneous computing system, has gained more and more attention in many fields. The problem is that the implementation needs amounts of coding and debugging work, even if the algorithm has been verified by high-level language in PC environment. Our demo presents a method which can reduce the time of developing an algorithm in an embedded and heterogeneous system by high level synthesis method. Least Square Support Vector Machine(LS-SVM) algorithm was realized on Zynq platform by translating high-level language to Hardware Description Language(HDL). Basing on the feature of the developed heterogeneous system and the theory of LS-SVM, three parts were implemented to realize LS-SVM which includes a generating Kernel Matrix module, a solving linear equations module and a forecasting module. The first and the third parts have been placed in ARM processor by C language. Moreover, considering that the second parts was compute-intensive, it has been realized in logic resource by using high-level language. To manage data communication and computing task, an SOPC system has been designed on Zynq platform which worked in PXI chassis. Experiments demonstrate that the design method is feasible and can be used for the implementation of other complicate algorithm. The precision and time consumption in computing are given at the end.

近年来，在嵌入式系统中，特别是在异构计算系统中实现复杂的算法越来越受到许多领域的关注。问题是，即使该算法在PC环境中经过高级语言的验证，其实现也需要大量的编码和调试工作。本演示提出了一种采用高级综合方法减少嵌入式异构系统中算法开发时间的方法。最小二乘支持向量机(LS-SVM)算法在Zynq平台上通过将高级语言转换为硬件描述语言(HDL)实现。根据已开发异构系统的特点，结合LS-SVM理论，实现了核矩阵生成模块、线性方程求解模块和预测模块三部分的LS-SVM实现。第一部分和第三部分用C语言编写在ARM处理器上。另外，考虑到第二部分的计算量较大，采用高级语言在逻辑资源上实现。为了实现对数据通信和计算任务的管理，在PXI机箱中工作的Zynq平台上设计了一个SOPC系统。实验表明，该设计方法是可行的，可用于其他复杂算法的实现。最后给出了计算精度和时间消耗。

{"title":"Implementation of LS-SVM with HLS on Zynq","authors":"Ma Ning, Wang Shaojun, Pang Yeyong, Peng Yu","doi":"10.1109/FPT.2014.7082816","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082816","url":null,"abstract":"In recent years, implementing a complicated algorithm in an embedded system, especially in a heterogeneous computing system, has gained more and more attention in many fields. The problem is that the implementation needs amounts of coding and debugging work, even if the algorithm has been verified by high-level language in PC environment. Our demo presents a method which can reduce the time of developing an algorithm in an embedded and heterogeneous system by high level synthesis method. Least Square Support Vector Machine(LS-SVM) algorithm was realized on Zynq platform by translating high-level language to Hardware Description Language(HDL). Basing on the feature of the developed heterogeneous system and the theory of LS-SVM, three parts were implemented to realize LS-SVM which includes a generating Kernel Matrix module, a solving linear equations module and a forecasting module. The first and the third parts have been placed in ARM processor by C language. Moreover, considering that the second parts was compute-intensive, it has been realized in logic resource by using high-level language. To manage data communication and computing task, an SOPC system has been designed on Zynq platform which worked in PXI chassis. Experiments demonstrate that the design method is feasible and can be used for the implementation of other complicate algorithm. The precision and time consumption in computing are given at the end.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"346-349"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76733763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Low-latency double-precision floating-point division for FPGAs fpga的低延迟双精度浮点除法

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082762

B. Liebig, A. Koch

With growing FPGA capacities, applications requiring more intensive use of floating-point arithmetic become feasible candidates for acceleration using reconfigurable logic. Still among the more uncommon operations, however, are fast double-precision divider units. Since our application domain (acceleration of custom-compiled convex solvers) heavily relies on these blocks, we have implemented low-latency dividers based on the Goldschmidt algorithm that are accurate up to 1 bit of least precision (1-ULP). On Virtex-6 devices, our units operate at 200 MHz and significantly outperform other state-of-the-art 1-ULP dividers. We evaluate our blocks both stand-alone, as well as on the application-level when used for the high-level synthesis of the convex solver cores.

随着FPGA容量的增长，需要更多使用浮点运算的应用成为使用可重构逻辑加速的可行候选。然而，在比较不常见的操作中，还有快速双精度除法单元。由于我们的应用领域(自定义编译凸求解器的加速)严重依赖于这些块，因此我们基于Goldschmidt算法实现了低延迟分频器，其精度可达最低精度的1位(1- ulp)。在Virtex-6设备上，我们的单元工作频率为200 MHz，明显优于其他最先进的1-ULP分压器。我们既独立地评估我们的块，也在应用程序级别评估用于凸求解器核心的高级合成时的块。

引用次数: 5

Automating customized computing 自动化定制计算

2014 International Conference on Field-Programmable Technology (FPT)

Pub Date : 2014-12-01 DOI: 10.1109/FPT.2014.7082743

J. Cong

Customized computing has been of interest to the research community for over three decades. The interest has intensified in the recent years as the power and energy become a significant limiting factor to the computing industry. For example, the energy consumed by the datacenters of some large internet service provides is well over 109 Kilowatt-hours. FPGA-based acceleration has shown 10–1000X performance/energy efficiency over the general-purpose processors in many applications. However, programming FPGAs as a computing device is still a significant challenge. Most of accelerators are designed using manual RTL coding. The recent progress in high-level synthesis (HLS) has improved the programming productivity considerably where one can quickly implement functional blocks written using high-level programming languages as C or C++ instead of RTL. But in using the HLS tool for accelerated computing, the programmer still faces a lot of design decisions, such as implementation choices of each module and communication schemes between different modules, and has to implement additional logic for data management, such as memory partitioning, data prefetching and reuse. Extensive source code rewriting is often required to achieve high-performance acceleration using the existing HLS tools. In this talk, I shall present the ongoing work at UCLA to enable further automation for customized computing. One effort is on automated compilation to combining source-code level transformation for HLS with efficient parameterized architecture template generations. I shall highlight our progress on loop restructuring and code generation, memory partitioning, data prefetching and reuse, combined module selection, duplication, and scheduling with communication optimization. These techniques allows the programmer to easily compile computation kernels to FPGAs for acceleration. Another direction is to develop efficient runtime support for scheduling and transparent resource management for integration of FPGAs for datacenter-scale acceleration, which is becoming a reality (for example, Microsoft recently used over 1,600 servers with FPGAs for accelerating their search engine and reported very encouraging results). Our runtime system provides scheduling and resource management support at multiple levels, including server node-level, job-level, and datacenter-level so that programmer can make use the existing programming interfaces, such as MapReduce or Hadoop, for large-scale distributed computation.

定制计算已经引起研究界三十多年的兴趣。近年来，随着电力和能源成为计算机行业的一个重要限制因素，人们的兴趣日益浓厚。例如，一些大型互联网服务提供商的数据中心消耗的能源远远超过109千瓦时。基于fpga的加速在许多应用中显示出比通用处理器10 - 1000倍的性能/能源效率。然而，将fpga编程作为一种计算设备仍然是一个重大挑战。大多数加速器都是使用手动RTL编码设计的。高级综合(HLS)的最新进展大大提高了编程效率，人们可以快速实现使用C或c++等高级编程语言而不是RTL编写的功能块。但是在使用HLS工具进行加速计算时，程序员仍然面临着许多设计决策，例如各个模块的实现选择和不同模块之间的通信方案，并且必须实现额外的数据管理逻辑，例如内存分区、数据预取和重用。为了使用现有的HLS工具实现高性能加速，通常需要大量的源代码重写。在这次演讲中，我将介绍加州大学洛杉矶分校正在进行的工作，以实现定制计算的进一步自动化。一项工作是自动编译，将HLS的源代码级转换与有效的参数化架构模板生成结合起来。我将重点介绍我们在循环重组和代码生成、内存分区、数据预取和重用、组合模块选择、复制和通信优化调度方面的进展。这些技术允许程序员很容易地将计算内核编译为fpga加速。另一个方向是开发高效的运行时支持，用于调度和透明的资源管理，以集成fpga实现数据中心规模的加速，这正在成为现实(例如，微软最近使用了超过1600台带有fpga的服务器来加速他们的搜索引擎，并报告了非常令人鼓舞的结果)。我们的运行时系统提供了多个级别的调度和资源管理支持，包括服务器节点级，作业级和数据中心级，以便程序员可以使用现有的编程接口，如MapReduce或Hadoop，进行大规模的分布式计算。

{"title":"Automating customized computing","authors":"J. Cong","doi":"10.1109/FPT.2014.7082743","DOIUrl":"https://doi.org/10.1109/FPT.2014.7082743","url":null,"abstract":"Customized computing has been of interest to the research community for over three decades. The interest has intensified in the recent years as the power and energy become a significant limiting factor to the computing industry. For example, the energy consumed by the datacenters of some large internet service provides is well over 109 Kilowatt-hours. FPGA-based acceleration has shown 10–1000X performance/energy efficiency over the general-purpose processors in many applications. However, programming FPGAs as a computing device is still a significant challenge. Most of accelerators are designed using manual RTL coding. The recent progress in high-level synthesis (HLS) has improved the programming productivity considerably where one can quickly implement functional blocks written using high-level programming languages as C or C++ instead of RTL. But in using the HLS tool for accelerated computing, the programmer still faces a lot of design decisions, such as implementation choices of each module and communication schemes between different modules, and has to implement additional logic for data management, such as memory partitioning, data prefetching and reuse. Extensive source code rewriting is often required to achieve high-performance acceleration using the existing HLS tools. In this talk, I shall present the ongoing work at UCLA to enable further automation for customized computing. One effort is on automated compilation to combining source-code level transformation for HLS with efficient parameterized architecture template generations. I shall highlight our progress on loop restructuring and code generation, memory partitioning, data prefetching and reuse, combined module selection, duplication, and scheduling with communication optimization. These techniques allows the programmer to easily compile computation kernels to FPGAs for acceleration. Another direction is to develop efficient runtime support for scheduling and transparent resource management for integration of FPGAs for datacenter-scale acceleration, which is becoming a reality (for example, Microsoft recently used over 1,600 servers with FPGAs for accelerating their search engine and reported very encouraging results). Our runtime system provides scheduling and resource management support at multiple levels, including server node-level, job-level, and datacenter-level so that programmer can make use the existing programming interfaces, such as MapReduce or Hadoop, for large-scale distributed computation.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"17 1","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87915682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 International Conference on Field-Programmable Technology (FPT)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀