首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Enabling Hardware Exploration in Software-Defined Networking: A Flexible, Portable OpenFlow Switch 在软件定义网络中实现硬件探索:一个灵活的、可移植的OpenFlow交换机
Asif Khan, Nirav H. Dave
The OpenFlow framework allows the data plane of a network switch to be managed by a software-based controller. This enables a software-defined networking model in which sophisticated network management policies can be deployed. In this paper, we present an FPGA-based switch which is fully-compliant with OpenFlow 1.0, and meets the 10 Gbps line rate. The switch design is both modular and highly parametrized. It has generic split-transaction interfaces and isolated platform-specific features, making it both flexible for architectural exploration and portable across FPGA platforms. The flow tables in the switch can be implemented on Block RAM or DRAM without any modifications to the rest of the design. The switch has been ported to the NetFPGA-10G, the ML605 and the DE4 boards. It can be integrated with a Desktop PC via either the PCIe or the serial link, and with an FPGA-based MIPS64 softcore as a coprocessor. The latter FPGA-based switch-processor system provides an ideal platform for network research in which both the data plane and the control plane can be explored.
OpenFlow框架允许网络交换机的数据平面由基于软件的控制器管理。这支持软件定义的网络模型,可以在其中部署复杂的网络管理策略。在本文中,我们提出了一个基于fpga的交换机,它完全符合OpenFlow 1.0,并满足10gbps的线路速率。开关设计是模块化和高度参数化的。它具有通用的拆分事务接口和独立的特定于平台的特性,使其既可以灵活地进行架构探索,又可以跨FPGA平台移植。开关中的流表可以在块RAM或DRAM上实现,而无需对其余设计进行任何修改。交换机已被移植到NetFPGA-10G、ML605和DE4板上。它可以通过PCIe或串行链路与台式PC机集成,并使用基于fpga的MIPS64软核作为协处理器。后一种基于fpga的开关处理器系统为网络研究提供了一个理想的平台,可以同时探索数据平面和控制平面。
{"title":"Enabling Hardware Exploration in Software-Defined Networking: A Flexible, Portable OpenFlow Switch","authors":"Asif Khan, Nirav H. Dave","doi":"10.1109/FCCM.2013.15","DOIUrl":"https://doi.org/10.1109/FCCM.2013.15","url":null,"abstract":"The OpenFlow framework allows the data plane of a network switch to be managed by a software-based controller. This enables a software-defined networking model in which sophisticated network management policies can be deployed. In this paper, we present an FPGA-based switch which is fully-compliant with OpenFlow 1.0, and meets the 10 Gbps line rate. The switch design is both modular and highly parametrized. It has generic split-transaction interfaces and isolated platform-specific features, making it both flexible for architectural exploration and portable across FPGA platforms. The flow tables in the switch can be implemented on Block RAM or DRAM without any modifications to the rest of the design. The switch has been ported to the NetFPGA-10G, the ML605 and the DE4 boards. It can be integrated with a Desktop PC via either the PCIe or the serial link, and with an FPGA-based MIPS64 softcore as a coprocessor. The latter FPGA-based switch-processor system provides an ideal platform for network research in which both the data plane and the control plane can be explored.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117027890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping 利用FPGA原型技术探索不规则应用的多核多节点系统
Marco Ceriani, G. Palermo, Simone Secchi, Antonino Tumeo, Oreste Villa
We propose an intermediate approach between full custom hardware systems and full-software tools. Figure 1 shows the overview of the proposed architecture. We start from an off-the-shelf architecture composed of simple, in-order cores and an on-chip interconnection. The onchip interconnection interfaces the processing core with the memory controller for the external memory (DDR3) and the shared I/O peripherals. We add three custom components: the Global Memory Access Scheduler (GMAS), the Global Network Interface (GNI) and the Global SYNChronization module (GSYNC). The GMAS enables support for the scrambled address space. It also implements part of the support latency tolerance, storing remote memory operations, and acts as a scheduler for lightweight software multithreading.
我们提出了一种介于全定制硬件系统和全软件工具之间的中间方法。图1显示了所建议的体系结构的概述。我们从一个现成的架构开始,由简单的、有序的内核和片上互连组成。片上互连将处理核心与用于外部存储器(DDR3)和共享I/O外设的存储器控制器连接起来。我们添加了三个自定义组件:全局内存访问调度器(GMAS)、全局网络接口(GNI)和全局同步模块(GSYNC)。GMAS支持加扰的地址空间。它还实现了部分支持延迟容忍,存储远程内存操作,并充当轻量级软件多线程的调度器。
{"title":"Exploring Manycore Multinode Systems for Irregular Applications with FPGA Prototyping","authors":"Marco Ceriani, G. Palermo, Simone Secchi, Antonino Tumeo, Oreste Villa","doi":"10.1109/FCCM.2013.62","DOIUrl":"https://doi.org/10.1109/FCCM.2013.62","url":null,"abstract":"We propose an intermediate approach between full custom hardware systems and full-software tools. Figure 1 shows the overview of the proposed architecture. We start from an off-the-shelf architecture composed of simple, in-order cores and an on-chip interconnection. The onchip interconnection interfaces the processing core with the memory controller for the external memory (DDR3) and the shared I/O peripherals. We add three custom components: the Global Memory Access Scheduler (GMAS), the Global Network Interface (GNI) and the Global SYNChronization module (GSYNC). The GMAS enables support for the scrambled address space. It also implements part of the support latency tolerance, storing remote memory operations, and acts as a scheduler for lightweight software multithreading.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133337787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Accelerating Join Operation for Relational Databases with FPGAs 用fpga加速关系数据库的联接操作
R. Halstead, Bharat Sukhwani, Hong Min, Mathew S. Thoennes, Parijat Dube, S. Asaad, B. Iyer
In this paper, we investigate the use of field programmable gate arrays (FPGAs) to accelerate relational joins. Relational join is one of the most CPU-intensive, yet commonly used, database operations. Hashing can be used to reduce the time complexity from quadratic (naïve) to linear time. However, doing so can introduce false positives to the results which must be resolved. We present a hash-join engine on FPGA that performs hashing, conflict resolution, and joining on a PCIe-attached system, achieving greater than 11x speedup over software.
在本文中,我们研究了使用现场可编程门阵列(fpga)来加速关系连接。关系连接是cpu最密集但最常用的数据库操作之一。哈希可以用来减少时间复杂度从二次(naïve)到线性时间。然而,这样做可能会导致必须解决的假阳性结果。我们在FPGA上提出了一个散列连接引擎,它可以在pcie连接的系统上执行散列、冲突解决和连接,比软件实现了11倍以上的加速。
{"title":"Accelerating Join Operation for Relational Databases with FPGAs","authors":"R. Halstead, Bharat Sukhwani, Hong Min, Mathew S. Thoennes, Parijat Dube, S. Asaad, B. Iyer","doi":"10.1109/FCCM.2013.17","DOIUrl":"https://doi.org/10.1109/FCCM.2013.17","url":null,"abstract":"In this paper, we investigate the use of field programmable gate arrays (FPGAs) to accelerate relational joins. Relational join is one of the most CPU-intensive, yet commonly used, database operations. Hashing can be used to reduce the time complexity from quadratic (naïve) to linear time. However, doing so can introduce false positives to the results which must be resolved. We present a hash-join engine on FPGA that performs hashing, conflict resolution, and joining on a PCIe-attached system, achieving greater than 11x speedup over software.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123661718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
A Configurable Architecture for a Visual Saliency System and Its Application in Retail 视觉显著性系统的可配置结构及其在零售中的应用
Nandhini Chandramoorthy, Siddharth Advani, K. Irick, N. Vijaykrishnan
Summary form only given. The objective of this paper is to present a configurable architecture for a visual saliency model based on AIM. It presents algorithmic enhancements to AIM that facilitates the design of a performance-efficient hardware architecture that offers tradeoffs between accuracy, resource utilization and latency. The AIM computational model involves (1) extraction of a set of coefficient features for each local patch in an image, (2) estimation of probability density for each coefficient with respect to its local surround, (3) computation of their product to give a joint likelihood and (4) computation of the self information of each pixel from its log likelihood. Calculation of likelihood with respect to each pixel individually in a local surround is computationally expensive. It proposes to approximate the contribution of pixels in the surround in terms of “cells” grouped further into “support zones”, whose widths are configurable. This approximation leads to nearly a 10x reduction in the number of multipliers, a critical resource, for a 41x41 surround size.
只提供摘要形式。本文的目的是为基于AIM的视觉显著性模型提供一个可配置的体系结构。它提出了AIM的算法增强,有助于设计性能高效的硬件架构,在准确性、资源利用率和延迟之间进行权衡。AIM计算模型包括:(1)提取图像中每个局部斑块的一组系数特征,(2)估计每个系数相对于其局部环绕的概率密度,(3)计算它们的乘积以给出联合似然,(4)从其对数似然计算每个像素的自信息。计算局部环绕中每个像素的似然值是非常昂贵的。它建议用“单元”来近似地表示环绕中像素的贡献,这些“单元”进一步分组为“支持区域”,其宽度是可配置的。这种近似导致乘法器的数量减少了近10倍,这是41 × 41环绕大小的关键资源。
{"title":"A Configurable Architecture for a Visual Saliency System and Its Application in Retail","authors":"Nandhini Chandramoorthy, Siddharth Advani, K. Irick, N. Vijaykrishnan","doi":"10.1109/FCCM.2013.41","DOIUrl":"https://doi.org/10.1109/FCCM.2013.41","url":null,"abstract":"Summary form only given. The objective of this paper is to present a configurable architecture for a visual saliency model based on AIM. It presents algorithmic enhancements to AIM that facilitates the design of a performance-efficient hardware architecture that offers tradeoffs between accuracy, resource utilization and latency. The AIM computational model involves (1) extraction of a set of coefficient features for each local patch in an image, (2) estimation of probability density for each coefficient with respect to its local surround, (3) computation of their product to give a joint likelihood and (4) computation of the self information of each pixel from its log likelihood. Calculation of likelihood with respect to each pixel individually in a local surround is computationally expensive. It proposes to approximate the contribution of pixels in the surround in terms of “cells” grouped further into “support zones”, whose widths are configurable. This approximation leads to nearly a 10x reduction in the number of multipliers, a critical resource, for a 41x41 surround size.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122862895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accuracy-Performance Tradeoffs on an FPGA through Overclocking 通过超频实现FPGA上的精度-性能权衡
Kan Shi, D. Boland, G. Constantinides
Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantization error into the design. In this paper, we demonstrate that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantization errors. Through the use of analytical models and empirical results on a Xilinx Virtex-6 FPGA, we show that a geometric mean reduction of 67.9% to 98.8% in error expectation or a geometric mean improvement of 3.1% to 27.6% in operating frequency can be obtained using this alternative design methodology.
嵌入式应用程序通常需要严格的延迟要求。虽然定制的基于fpga的加速器中的高度并行性可能在某种程度上有所帮助,但也可能有必要限制数据路径中使用的精度,以提高实现的工作频率。然而,通过降低精度,工程师将量化误差引入到设计中。在本文中,我们证明了对于许多应用程序,最好是简单地对设计进行超频,并接受可能出现的时间冲突。由于定时违反引入的误差很少发生,因此它们比量化误差产生的噪声更小。通过对Xilinx Virtex-6 FPGA的分析模型和经验结果,我们表明使用这种替代设计方法可以获得误差预期的几何平均降低67.9%至98.8%,工作频率的几何平均提高3.1%至27.6%。
{"title":"Accuracy-Performance Tradeoffs on an FPGA through Overclocking","authors":"Kan Shi, D. Boland, G. Constantinides","doi":"10.1109/FCCM.2013.10","DOIUrl":"https://doi.org/10.1109/FCCM.2013.10","url":null,"abstract":"Embedded applications can often demand stringent latency requirements. While high degrees of parallelism within custom FPGA-based accelerators may help to some extent, it may also be necessary to limit the precision used in the datapath to boost the operating frequency of the implementation. However, by reducing the precision, the engineer introduces quantization error into the design. In this paper, we demonstrate that for many applications it would be preferable to simply overclock the design and accept that timing violations may arise. Since the errors introduced by timing violations occur rarely, they will cause less noise than quantization errors. Through the use of analytical models and empirical results on a Xilinx Virtex-6 FPGA, we show that a geometric mean reduction of 67.9% to 98.8% in error expectation or a geometric mean improvement of 3.1% to 27.6% in operating frequency can be obtained using this alternative design methodology.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123771161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A Delay-based PUF Design Using Multiplexers on FPGA 基于FPGA多路复用器的延时PUF设计
Miaoqing Huang, Shiming Li
Summary form only given. Physically unclonable functions (PUFs) have been a hot research topic in hardware-oriented security for many years. Given a challenge as an input to the PUF, it generates a corresponding response, which can be treated as a unique fingerprint or signature for authentication purpose. In this paper, a delay-based PUF design involving multiplexers on FPGA is presented. Due to the intrinsic difference of the switching latencies of two chained multiplexers, a positive pulse may be produced at the output of the downstream multiplexer. This pulse can be used to set the output of a D flip-flop to `1'. The proposed design improves the randomness of the outputs of the PUF.
只提供摘要形式。物理不可克隆函数(puf)是近年来面向硬件安全领域的一个研究热点。如果将质询作为PUF的输入,它将生成相应的响应,该响应可被视为用于身份验证的唯一指纹或签名。本文提出了一种基于FPGA多路复用器的延时PUF设计方法。由于两个链式多路复用器的开关延时的内在差异,可能在下游多路复用器的输出端产生一个正脉冲。该脉冲可用于将D触发器的输出设置为“1”。提出的设计改善了PUF输出的随机性。
{"title":"A Delay-based PUF Design Using Multiplexers on FPGA","authors":"Miaoqing Huang, Shiming Li","doi":"10.1109/FCCM.2013.11","DOIUrl":"https://doi.org/10.1109/FCCM.2013.11","url":null,"abstract":"Summary form only given. Physically unclonable functions (PUFs) have been a hot research topic in hardware-oriented security for many years. Given a challenge as an input to the PUF, it generates a corresponding response, which can be treated as a unique fingerprint or signature for authentication purpose. In this paper, a delay-based PUF design involving multiplexers on FPGA is presented. Due to the intrinsic difference of the switching latencies of two chained multiplexers, a positive pulse may be produced at the output of the downstream multiplexer. This pulse can be used to set the output of a D flip-flop to `1'. The proposed design improves the randomness of the outputs of the PUF.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128688638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Efficient Large Integer Squarers on FPGA 基于FPGA的高效大整数平方器
Simin Xu, Suhaib A. Fahmy, I. Mcloughlin
This paper presents an optimised high throughput architecture for integer squaring on FPGAs. The approach reduces the number of DSP blocks required compared to a standard multiplier. Previous work has proposed the tiling method for double precision squaring, using the least number of DSP blocks so far. However that approach incurs a large overhead in terms of look-up table (LUT) consumption and has a complex and irregular structure that is not suitable for higher word size. The architecture proposed in this paper can reduce DSP block usage by an equivalent amount to the tiling method while incurring a much lower LUT overhead: 21.8% fewer LUTs for a 53-bit squarer. The architecture is mapped to a Xilinx Virtex 6 FPGA and evaluated for a wide range of operand word sizes, demonstrating its scalability and efficiency.
本文提出了一种优化的fpga上整数平方的高吞吐量架构。与标准乘法器相比,该方法减少了所需DSP模块的数量。以前的工作已经提出了双精度平方的平铺方法,使用迄今为止最少数量的DSP块。但是,这种方法在查找表(LUT)消耗方面会产生很大的开销,并且具有复杂和不规则的结构,不适合较大的单词大小。本文提出的体系结构可以减少与平铺方法等量的DSP块使用,同时产生更低的LUT开销:53位平方器的LUT减少21.8%。该架构被映射到Xilinx Virtex 6 FPGA上,并对各种操作数字大小进行了评估,证明了其可扩展性和效率。
{"title":"Efficient Large Integer Squarers on FPGA","authors":"Simin Xu, Suhaib A. Fahmy, I. Mcloughlin","doi":"10.1109/FCCM.2013.35","DOIUrl":"https://doi.org/10.1109/FCCM.2013.35","url":null,"abstract":"This paper presents an optimised high throughput architecture for integer squaring on FPGAs. The approach reduces the number of DSP blocks required compared to a standard multiplier. Previous work has proposed the tiling method for double precision squaring, using the least number of DSP blocks so far. However that approach incurs a large overhead in terms of look-up table (LUT) consumption and has a complex and irregular structure that is not suitable for higher word size. The architecture proposed in this paper can reduce DSP block usage by an equivalent amount to the tiling method while incurring a much lower LUT overhead: 21.8% fewer LUTs for a 53-bit squarer. The architecture is mapped to a Xilinx Virtex 6 FPGA and evaluated for a wide range of operand word sizes, demonstrating its scalability and efficiency.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115849487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Application Composition and Communication Optimization in Iterative Solvers Using FPGAs 基于fpga的迭代求解器的应用组合与通信优化
A. Rafique, Nachiket Kapre, G. Constantinides
We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.
我们考虑了在求解大规模特征值问题和线性方程组的迭代解算器中最小化与片外存储器的通信和多重线性代数核的组成问题。虽然gpu可以为单个内核提供更高的吞吐量,但由于无法支持跨内核的片上数据共享,整体应用程序性能受到限制。在本文中,我们证明了更高的片上存储容量和优越的片上通信带宽使fpga能够更好地支持这些迭代求解器中一系列核的组成。我们提出了一种时间复用FPGA架构,它利用片上容量来存储内核之间的依赖关系和高通信带宽来移动数据。我们提出了一个资源约束框架来选择算法参数的最优值,为特定的FPGA提供通信和计算成本之间的权衡。使用Lanczos方法作为案例研究,我们展示了如何通过这种紧密的算法-架构交互来最小化fpga上的通信,并获得优于GPU的性能,尽管其片外内存带宽大~5倍,峰值单精度浮点性能高~2倍。
{"title":"Application Composition and Communication Optimization in Iterative Solvers Using FPGAs","authors":"A. Rafique, Nachiket Kapre, G. Constantinides","doi":"10.1109/FCCM.2013.16","DOIUrl":"https://doi.org/10.1109/FCCM.2013.16","url":null,"abstract":"We consider the problem of minimizing communication with off-chip memory and composition of multiple linear algebra kernels in iterative solvers for solving large-scale eigenvalue problems and linear systems of equations. While GPUs may offer higher throughput for individual kernels, overall application performance is limited by the inability to support on-chip sharing of data across kernels. In this paper, we show that higher on-chip memory capacity and superior on-chip communication bandwidth enables FPGAs to better support the composition of a sequence of kernels within these iterative solvers. We present a time-multiplexed FPGA architecture which exploits the on-chip capacity to store dependencies between kernels and high communication bandwidth to move data. We propose a resource-constrained framework to select the optimal value of an algorithmic parameter which provides the tradeoff between communication and computation cost for a particular FPGA. Using the Lanczos Method as a case study, we show how to minimize communication on FPGAs by this tight algorithm-architecture interaction and get superior performance over GPU despite of its ~5x larger off-chip memory bandwidth and ~2x greater peak singleprecision floating-point performance.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116526615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor 使用Razor实现紧密耦合CGRAs和处理器阵列的安全超频
Alexander Brant, Ameer Abdelhadi, Douglas H. H. Sim, S. Tang, Michael Xi Yue, G. Lemieux
Overclocking a CPU is a common practice among home-built PC enthusiasts where the CPU is operated at a higher frequency than its speed rating. This practice is unsafe because timing errors cannot be detected by modern CPUs and they can be practically undetectable by the end user. Using a timing speculation technique such as Razor, it is possible to detect timing errors in CPUs. To date, Razor has been shown to correct only unidirectional, feed-forward processor pipelines. In this paper, we safely overclock 2D arrays by extending Razor correction to cover bidirectional communication in a tightly coupled or lockstep fashion. To recover from an error, stall wavefronts are produced which propagate across the device. Multiple errors may arise in close proximity in time and space; if the corresponding stall wavefronts collide, they merge to produce a single unified wavefront, allowing recovery from multiple errors with one stall cycle. We demonstrate the correctness and viability of our approach by constructing a proof-of-concept prototype which runs on a traditional Altera FPGA. Our approach can be applied to custom computing arrays, systolic arrays, CGRAs, and also time-multiplexed FPGAs such as those produced by Tabula. As a result, these devices can be overclocked and safely tolerate dynamic, data-dependent timing errors. Alternatively, instead of overclocking, this same technique can be used to `undervolt' the power supply and save energy.
在家用电脑爱好者中,CPU超频是一种常见的做法,即CPU以高于其速度额定的频率运行。这种做法是不安全的,因为现代cpu无法检测到时序错误,而最终用户实际上也无法检测到它们。使用像Razor这样的时序推测技术,可以检测cpu中的时序错误。迄今为止,Razor已经被证明只能纠正单向的前馈处理器管道。在本文中,我们通过扩展Razor校正以紧耦合或同步方式覆盖双向通信来安全地超频2D阵列。为了从错误中恢复,会产生穿过器件传播的失速波前。在时间和空间上接近时可能产生多重误差;如果相应的失速波前发生碰撞,它们合并产生一个统一的波前,允许在一个失速周期内从多个错误中恢复。我们通过构建一个在传统Altera FPGA上运行的概念验证原型来证明我们方法的正确性和可行性。我们的方法可以应用于自定义计算阵列,收缩阵列,CGRAs,以及时间复用fpga,如Tabula生产的那些。因此,这些设备可以超频并安全地容忍动态的、与数据相关的定时错误。或者,代替超频,同样的技术可以用来“欠压”电源和节省能源。
{"title":"Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor","authors":"Alexander Brant, Ameer Abdelhadi, Douglas H. H. Sim, S. Tang, Michael Xi Yue, G. Lemieux","doi":"10.1109/FCCM.2013.63","DOIUrl":"https://doi.org/10.1109/FCCM.2013.63","url":null,"abstract":"Overclocking a CPU is a common practice among home-built PC enthusiasts where the CPU is operated at a higher frequency than its speed rating. This practice is unsafe because timing errors cannot be detected by modern CPUs and they can be practically undetectable by the end user. Using a timing speculation technique such as Razor, it is possible to detect timing errors in CPUs. To date, Razor has been shown to correct only unidirectional, feed-forward processor pipelines. In this paper, we safely overclock 2D arrays by extending Razor correction to cover bidirectional communication in a tightly coupled or lockstep fashion. To recover from an error, stall wavefronts are produced which propagate across the device. Multiple errors may arise in close proximity in time and space; if the corresponding stall wavefronts collide, they merge to produce a single unified wavefront, allowing recovery from multiple errors with one stall cycle. We demonstrate the correctness and viability of our approach by constructing a proof-of-concept prototype which runs on a traditional Altera FPGA. Our approach can be applied to custom computing arrays, systolic arrays, CGRAs, and also time-multiplexed FPGAs such as those produced by Tabula. As a result, these devices can be overclocked and safely tolerate dynamic, data-dependent timing errors. Alternatively, instead of overclocking, this same technique can be used to `undervolt' the power supply and save energy.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128253827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Acceleration of SQL Restrictions and Aggregations through FPGA-Based Dynamic Partial Reconfiguration 基于fpga的动态局部重构加速SQL约束和聚合
C. Dennl, Daniel Ziener, J. Teich
SQL query processing on large database systems is recognized as one of the most important emerging disciplines of computing nowadays. However, current approaches do not provide a substantial coverage of typical query operators in hardware. In this paper, we provide an important step to higher operator coverage by proposing a) full dynamic data path generation for support also complex operators such as restrictions and aggregations. b) Also, an analysis of the computation times of a real database queries when running on a normal desktop computer is proposed to show that c) speedups ranging between 4 and 50 are obtainable by providing generative support also for the important restrict and aggregate operators using FPGAs.
大型数据库系统上的SQL查询处理是当今计算领域最重要的新兴学科之一。然而,目前的方法并没有提供硬件中典型查询操作符的大量覆盖。在本文中,我们提出了一个重要的步骤,通过提出a)全动态数据路径生成,以支持复杂的操作,如限制和聚合。b)此外,对在普通台式计算机上运行的真实数据库查询的计算时间进行了分析,表明c)通过使用fpga为重要的限制和聚合运算符提供生成支持,可以获得4到50之间的速度提升。
{"title":"Acceleration of SQL Restrictions and Aggregations through FPGA-Based Dynamic Partial Reconfiguration","authors":"C. Dennl, Daniel Ziener, J. Teich","doi":"10.1109/FCCM.2013.38","DOIUrl":"https://doi.org/10.1109/FCCM.2013.38","url":null,"abstract":"SQL query processing on large database systems is recognized as one of the most important emerging disciplines of computing nowadays. However, current approaches do not provide a substantial coverage of typical query operators in hardware. In this paper, we provide an important step to higher operator coverage by proposing a) full dynamic data path generation for support also complex operators such as restrictions and aggregations. b) Also, an analysis of the computation times of a real database queries when running on a normal desktop computer is proposed to show that c) speedups ranging between 4 and 50 are obtainable by providing generative support also for the important restrict and aggregate operators using FPGAs.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124099702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1