首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
A Range and Scaling Study of an FPGA-Based Digital Wireless Channel Emulator 基于fpga的数字无线信道仿真器的范围和缩放研究
Scott Buscemi, William V. Kritikos, R. Sass
A Digital Wireless Channel Emulator (DWCE) is a system that is capable of emulating the RF environment for a group of wireless devices. A major issue with current designs is that they do not scale to a large enough number of nodes to emulate meaningful network. A reason for this lack of scalability is the large amount of computations and network capacity required for such a system. Previously documented DWCE systems implement a hub-and-spoke configuration that inhibits them from simply adding additional hardware to scale. This paper investigates the use of a FPGA cluster configured as a distributed system to provide the computational and network structure to scale a DWCE to support 1250 wireless devices. This scale is approximately two orders of magnitude larger than any other previously documented system. This paper presents multiple FPGA cluster configurations that use currently available hardware and describes the algorithms used to route the signals through the network and place the computational hardware on each FPGA. The low level VHDL Signal Path Component (SPC) is synthesized and mapped under different parameters to interpolate is resource utilization. One example FPGA build with enough SPCs to fill 80% of the FPGA resources is successfully run through the Xilinx tool-chain to determine the maximum FPGA system clock speed. Finally, the scaling results are presented that detail the maximum sample frequency of various sized DWCE systems which could be used to examine a variety of wireless devices.
数字无线信道仿真器(DWCE)是一种能够模拟一组无线设备的射频环境的系统。当前设计的一个主要问题是,它们不能扩展到足够多的节点来模拟有意义的网络。缺乏可伸缩性的一个原因是这种系统需要大量的计算和网络容量。以前记录的DWCE系统实现了一种轮辐配置,禁止它们简单地添加额外的硬件来进行扩展。本文研究了使用FPGA集群配置为分布式系统,以提供计算和网络结构来扩展DWCE以支持1250个无线设备。这个规模大约比以前记载的任何其他系统都要大两个数量级。本文介绍了使用当前可用硬件的多个FPGA集群配置,并描述了用于通过网络路由信号和将计算硬件放置在每个FPGA上的算法。对低电平VHDL信号路径分量(SPC)进行了合成,并在不同参数下进行了映射,以插值其资源利用率。通过Xilinx工具链成功地运行了一个示例FPGA构建,其中有足够的spc来填充80%的FPGA资源,以确定FPGA系统的最大时钟速度。最后,给出了缩放结果,详细说明了各种尺寸的DWCE系统的最大采样频率,可用于检测各种无线设备。
{"title":"A Range and Scaling Study of an FPGA-Based Digital Wireless Channel Emulator","authors":"Scott Buscemi, William V. Kritikos, R. Sass","doi":"10.1109/FCCM.2013.42","DOIUrl":"https://doi.org/10.1109/FCCM.2013.42","url":null,"abstract":"A Digital Wireless Channel Emulator (DWCE) is a system that is capable of emulating the RF environment for a group of wireless devices. A major issue with current designs is that they do not scale to a large enough number of nodes to emulate meaningful network. A reason for this lack of scalability is the large amount of computations and network capacity required for such a system. Previously documented DWCE systems implement a hub-and-spoke configuration that inhibits them from simply adding additional hardware to scale. This paper investigates the use of a FPGA cluster configured as a distributed system to provide the computational and network structure to scale a DWCE to support 1250 wireless devices. This scale is approximately two orders of magnitude larger than any other previously documented system. This paper presents multiple FPGA cluster configurations that use currently available hardware and describes the algorithms used to route the signals through the network and place the computational hardware on each FPGA. The low level VHDL Signal Path Component (SPC) is synthesized and mapped under different parameters to interpolate is resource utilization. One example FPGA build with enough SPCs to fill 80% of the FPGA resources is successfully run through the Xilinx tool-chain to determine the maximum FPGA system clock speed. Finally, the scaling results are presented that detail the maximum sample frequency of various sized DWCE systems which could be used to examine a variety of wireless devices.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125909523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Escaping the Academic Sandbox: Realizing VPR Circuits on Xilinx Devices 逃离学术沙箱:在赛灵思设备上实现VPR电路
Eddie Hung, F. Eslami, S. Wilton
This paper presents a new, open-source method for FPGA CAD researchers to realize their techniques on real Xilinx devices. Specifically, we extend the Verilog-To-Routing (VTR) suite, which includes the VPR place-and-route CAD tool on which many FPGA innovations have been based, to generate working Xilinx bitstreams via the Xilinx Design Language (XDL). Currently, we can faithfully translate VPR's heterogeneous packing and placement results into an exact Xilinx `map' netlist, which is then routed by its `par' tool. We showcase the utility of this new method with two compelling applications targeting a 40nm Virtex-6 device: a fair comparison of the area, delay, and CAD runtime of academia's state-of-the-art VTR How with a commercial, closed-source equivalent, along with a CAD experiment evaluated using physical measurements of on-chip power consumption and die temperature, over time. This extended How - VTR-to-Bitstream - is released to the community with the hope that it can enhance existing research projects as well as unlock new ones.
本文为FPGA CAD研究人员提供了一种新的、开源的方法,可以在实际的赛灵思器件上实现他们的技术。具体来说,我们扩展了Verilog-To-Routing (VTR)套件,其中包括许多FPGA创新所基于的VPR位置和路由CAD工具,通过Xilinx设计语言(XDL)生成工作的Xilinx比特流。目前,我们可以将VPR的异构封装和放置结果忠实地转换为精确的Xilinx“地图”网表,然后通过其“par”工具进行路由。我们通过针对40nm Virtex-6器件的两个引人注目的应用展示了这种新方法的实用性:学术界最先进的VTR How与商业闭源等效器件的面积、延迟和CAD运行时间的公平比较,以及使用芯片上功耗和芯片温度随时间的物理测量进行CAD实验评估。这个扩展的How - VTR-to-Bitstream -被发布给社区,希望它可以增强现有的研究项目,并解锁新的项目。
{"title":"Escaping the Academic Sandbox: Realizing VPR Circuits on Xilinx Devices","authors":"Eddie Hung, F. Eslami, S. Wilton","doi":"10.1109/FCCM.2013.40","DOIUrl":"https://doi.org/10.1109/FCCM.2013.40","url":null,"abstract":"This paper presents a new, open-source method for FPGA CAD researchers to realize their techniques on real Xilinx devices. Specifically, we extend the Verilog-To-Routing (VTR) suite, which includes the VPR place-and-route CAD tool on which many FPGA innovations have been based, to generate working Xilinx bitstreams via the Xilinx Design Language (XDL). Currently, we can faithfully translate VPR's heterogeneous packing and placement results into an exact Xilinx `map' netlist, which is then routed by its `par' tool. We showcase the utility of this new method with two compelling applications targeting a 40nm Virtex-6 device: a fair comparison of the area, delay, and CAD runtime of academia's state-of-the-art VTR How with a commercial, closed-source equivalent, along with a CAD experiment evaluated using physical measurements of on-chip power consumption and die temperature, over time. This extended How - VTR-to-Bitstream - is released to the community with the hope that it can enhance existing research projects as well as unlock new ones.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"58 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114101829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
On-chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs 部分可重构fpga上硬件任务的片上上下文保存与恢复
Aurelio Morales-Villanueva, A. Gordon-Ross
Partial reconfiguration (PR) of field-programmable gate arrays (FPGAs) enables hardware tasks to time multiplex PR regions (PRRs) by isolating reconfiguration to only the reconfigured PRR, which avoids halting the entire FPGA's execution. Time multiplexing PRRs requires support for unloading/loading tasks and for resuming a task's execution state. In order to resume a task's execution state, the execution state (context) must be saved when the task is unloaded so that the execution state can be restored when the task resumes- context save (CS) and context restore (CR), respectively. In this paper, we present a software-based, on-chip context save and restore (CSR) for PR-capable FPGAs. As compared to prior work, our CSR is autonomous (i.e., does not require any external host support), does not require custom on-chip hardware, is portable across any system design, and does not require tool flow modifications or special tools. Experimental results extensively evaluate the CSR execution time based on PRR size, enabling designers to trade off PRR granularity for CSR execution time based on application requirements.
现场可编程门阵列(FPGA)的部分重构(PR)通过将重构隔离到重新配置的PRR,从而使硬件任务能够对多路PR区域(PRR)进行定时,从而避免了整个FPGA的执行中断。时间复用PRRs需要支持卸载/加载任务和恢复任务的执行状态。为了恢复任务的执行状态,必须在任务卸载时保存执行状态(上下文),以便在任务恢复时恢复执行状态-分别为上下文保存(CS)和上下文恢复(CR)。在本文中,我们提出了一种基于软件的片上上下文保存和恢复(CSR),用于具有pr功能的fpga。与之前的工作相比,我们的CSR是自主的(即,不需要任何外部主机支持),不需要定制的片上硬件,可移植到任何系统设计中,不需要修改工具流程或特殊工具。实验结果广泛评估了基于PRR大小的CSR执行时间,使设计人员能够根据应用程序需求在PRR粒度与CSR执行时间之间进行权衡。
{"title":"On-chip Context Save and Restore of Hardware Tasks on Partially Reconfigurable FPGAs","authors":"Aurelio Morales-Villanueva, A. Gordon-Ross","doi":"10.1109/FCCM.2013.13","DOIUrl":"https://doi.org/10.1109/FCCM.2013.13","url":null,"abstract":"Partial reconfiguration (PR) of field-programmable gate arrays (FPGAs) enables hardware tasks to time multiplex PR regions (PRRs) by isolating reconfiguration to only the reconfigured PRR, which avoids halting the entire FPGA's execution. Time multiplexing PRRs requires support for unloading/loading tasks and for resuming a task's execution state. In order to resume a task's execution state, the execution state (context) must be saved when the task is unloaded so that the execution state can be restored when the task resumes- context save (CS) and context restore (CR), respectively. In this paper, we present a software-based, on-chip context save and restore (CSR) for PR-capable FPGAs. As compared to prior work, our CSR is autonomous (i.e., does not require any external host support), does not require custom on-chip hardware, is portable across any system design, and does not require tool flow modifications or special tools. Experimental results extensively evaluate the CSR execution time based on PRR size, enabling designers to trade off PRR granularity for CSR execution time based on application requirements.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128106302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Birth and adolescence of reconfigurable computing: a survey of the first 20 years of field-programmable custom computing machines 可重构计算的诞生和青春期:对现场可编程定制计算机器前20年的调查
Kenneth L. Pocek, R. Tessier, A. DeHon
For 20 years, the International Symposium on Field-Programmable Custom Computing Machines (FCCM) has explored how FPGAs and FPGA-like architectures can bring unique capabilities to computational tasks. We survey the evolution of the field of reconfigurable computing as reflected in FCCM, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.
20年来,现场可编程定制计算机国际研讨会(FCCM)一直在探索fpga和类fpga架构如何为计算任务带来独特的能力。我们调查了FCCM中反映的可重构计算领域的发展,为体系结构、计算模型、工具、运行时重构和应用程序中积累的知识体提供了指南。
{"title":"Birth and adolescence of reconfigurable computing: a survey of the first 20 years of field-programmable custom computing machines","authors":"Kenneth L. Pocek, R. Tessier, A. DeHon","doi":"10.1109/FPGA.2013.6882273","DOIUrl":"https://doi.org/10.1109/FPGA.2013.6882273","url":null,"abstract":"For 20 years, the International Symposium on Field-Programmable Custom Computing Machines (FCCM) has explored how FPGAs and FPGA-like architectures can bring unique capabilities to computational tasks. We survey the evolution of the field of reconfigurable computing as reflected in FCCM, providing a guide to the body-of-knowledge accumulated in architecture, compute models, tools, run-time reconfiguration, and applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134211983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Global Atmospheric Simulation on a Reconfigurable Platform 基于可重构平台的全球大气模拟
L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang
Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we appl
只提供摘要形式。气候模拟作为研究长期气候趋势和预测潜在气候风险的唯一方法,正成为各国政府和研究机构的重点研究课题。气候模式中最重要和最具挑战性的组成部分之一是大气模式。为了覆盖气候模拟场景的高分辨率,开发人员必须面对来自数十亿网格点和极其复杂的算法的挑战。浅水方程(SWEs)是一组守恒定律,它表现了大气的大多数基本特征。对swe的研究可以作为理解全球大气动力学行为的起点。我们选择立方球网格作为计算网格,因为它比其他网格(如经纬度网格)在极点区域具有更好的负载平衡。立方体-球体网格是通过将立方体映射到球体表面来获得的。计算域为六个补丁,每个补丁上覆盖N × N个待计算的网格点。当用局部坐标写时,ses在六个小块上有相同的表达式,即∂Q/∂t + 1/Λ∂(ΛF1)/∂x1 + 1/Λ∂(ΛF1)/∂z2 + S=0,(1)其中(x1, x2)∈[-π/4, π/4]为局部坐标,Q = (h, hu1, hu2) t为预测变量,Fi = uiQ (i = 1,2)为对流通量,S为源项。采用以单元为中心的有限体积法进行空间离散,并结合二阶精确TVD龙格-库塔法,将SWE求解方法转化为菱形13点迎风模板的计算。为了得到中心点的预测分量(h, hu1和hu2),需要访问其邻近的12个点。模板内核包括至少434个ADD/SUB操作,570个乘法,99个除法。SWEs算法的高算法密度使得它难以在资源有限的FPGA卡上实现一个内核。在本研究中,我们首先提出了一种混合算法,利用cpu和fpga来模拟全局浅水方程(SWEs)。在每个计算补丁中,大部分复杂的通信发生在外边界的两层,这两层的值需要与其他补丁进行交换。因此,我们将六个补丁中的每一个都分解为一个包含两层外边界网格的外部部分和一个剩余部分的内部部分。我们分配CPU处理外部的通信和模板计算,而分配FPGA处理内部的模板。这样,FPGA和CPU可以同时工作,并且可以将CPU用于模板和通信的时间隐藏在FPGA用于模板的时间中。对于我们在研究中使用的Virtex-6 SX475T,双精度的原始程序将需要299%的板载lut, 283%的ff和189%的dsp,并且不能放入一个FPGA中。为了将SWE内核装入一个FPGA芯片,我们对原始设计进行了两种算法优化。一种是用查找表代替某些计算,以减少计算资源的使用。二是找出算法中的公因数,消除冗余计算。这两个优化减少了20%的资源使用。为了进一步降低资源成本,并将极其复杂的模板内核装入一个FPGA芯片,我们在可定制的表示和精度空间中进行了优化。对于范围较小的变量,采用定点数代替双精度。对于其他动态范围较大的部分,我们使用混合精度的浮点数。通过混合精度浮点和定点运算,我们在单个FPGA上构建了一个复杂的逆风模板内核。该设计包括一个高效的管道,可以同时执行数百个浮点和定点算术运算。与我们之前的工作[1]相比,基于1个FPGA加速卡的解决方案在6核CPU上提供100倍的加速,在由12个CPU核和1个费米GPU组成的天河1a超级计算机节点上提供4倍的加速。
{"title":"Global Atmospheric Simulation on a Reconfigurable Platform","authors":"L. Gan, H. Fu, W. Luk, Chao Yang, Wei Xue, Guangwen Yang","doi":"10.1109/FCCM.2013.26","DOIUrl":"https://doi.org/10.1109/FCCM.2013.26","url":null,"abstract":"Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we appl","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123949155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A High Throughput No-Stall Golomb-Rice Hardware Decoder 一种高吞吐量无失速Golomb-Rice硬件解码器
R. Moussalli, W. Najjar, Xi Luo, Amna Khan
Integer compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-wise lossless technique applied to the compression of images, audio files and lists of inverted indices. However, since GR is a serial algorithm, decompression is regarded as a very slow process; to the best of our knowledge, all existing software and hardware native (non-modified) GR decoding engines operate bit-serially on the encoded stream. In this paper, we present (1) the first no-stall hardware architecture, capable of decompressing streams of integers compressed using the GR method, at a rate of several bytes (multiple integers) per hardware cycle; (2) a novel GR decoder based on the latter architecture is further detailed, operating at a peak rate of one integer per cycle. A thorough design space exploration study on the resulting resource utilization and throughput of the aforementioned approaches is presented. Furthermore, a performance study is provided, comparing software approaches to implementations of the novel hardware decoders. While occupying 10% of a Xilinx V6LX240T FPGA, the no-stall architecture core achieves a sustained throughput of over 7 Gbps.
整数压缩技术通常可以分为位压缩和字节压缩两种方法。虽然以更长的处理时间为代价,但位技术通常会产生更好的压缩比。Golomb-Rice (GR)方法是一种用于压缩图像、音频文件和倒排索引列表的逐位无损技术。然而,由于GR是一个串行算法,解压缩被认为是一个非常缓慢的过程;据我们所知,所有现有的软件和硬件原生(未修改的)GR解码引擎在编码流上按位串行操作。在本文中,我们提出了(1)第一个无停机硬件架构,能够以每个硬件周期几个字节(多个整数)的速率对使用GR方法压缩的整数流进行解压缩;(2)进一步详细介绍了基于后一种结构的新型GR解码器,其峰值速率为每周期一个整数。对上述方法的资源利用率和吞吐量进行了全面的设计空间探索研究。此外,还提供了性能研究,比较了新型硬件解码器的软件实现方法。虽然只占Xilinx V6LX240T FPGA的10%,但无停机架构核心实现了超过7 Gbps的持续吞吐量。
{"title":"A High Throughput No-Stall Golomb-Rice Hardware Decoder","authors":"R. Moussalli, W. Najjar, Xi Luo, Amna Khan","doi":"10.1109/FCCM.2013.9","DOIUrl":"https://doi.org/10.1109/FCCM.2013.9","url":null,"abstract":"Integer compression techniques can generally be classified as bit-wise and byte-wise approaches. Though at the cost of a larger processing time, bit-wise techniques typically result in a better compression ratio. The Golomb-Rice (GR) method is a bit-wise lossless technique applied to the compression of images, audio files and lists of inverted indices. However, since GR is a serial algorithm, decompression is regarded as a very slow process; to the best of our knowledge, all existing software and hardware native (non-modified) GR decoding engines operate bit-serially on the encoded stream. In this paper, we present (1) the first no-stall hardware architecture, capable of decompressing streams of integers compressed using the GR method, at a rate of several bytes (multiple integers) per hardware cycle; (2) a novel GR decoder based on the latter architecture is further detailed, operating at a peak rate of one integer per cycle. A thorough design space exploration study on the resulting resource utilization and throughput of the aforementioned approaches is presented. Furthermore, a performance study is provided, comparing software approaches to implementations of the novel hardware decoders. While occupying 10% of a Xilinx V6LX240T FPGA, the no-stall architecture core achieves a sustained throughput of over 7 Gbps.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A Fast and Accurate FPGA-Based Fault Injection System 基于fpga的快速准确故障注入系统
Thomas Schweizer, Dustin Peterson, Johannes Maximilian Kühn, T. Kuhn, W. Rosenstiel
This paper introduces an FPGA-based fault injection system. To realize this system a library was developed, which implements a static mapping between a circuit described at RTL or gate-level and its corresponding placed and routed FPGA design. The aim of this mapping is to preserve module and port structure of the placed and routed FPGA design to the RT/gate-level circuit description. To demonstrate the accuracy of this mapping the ISCAS'89 benchmark circuits and the VHDL netlist of the LEON3 system are used. The results show that about 99% of the ports in the RT/gate-level circuit description can be located in the placed and routed FPGA design. Based on this library a fault injection tool was developed to accelerate the fault injection experiment time by bypassing some stages (synthesis, placement and routing) of a re-compilation process. In these experiments a 12 × speedup was achieved when compared to fault injections based on serial fault emulation.
介绍了一种基于fpga的故障注入系统。为了实现该系统,开发了一个库,实现了RTL或门级电路与相应的放置和路由FPGA设计之间的静态映射。这种映射的目的是将放置和路由的FPGA设计的模块和端口结构保留为RT/门级电路描述。为了证明这种映射的准确性,使用了ISCAS'89基准电路和LEON3系统的VHDL网络列表。结果表明,RT/门级电路描述中约99%的端口可以定位在放置和路由的FPGA设计中。在此基础上,开发了故障注入工具,绕过了重新编译过程中的合成、放置和路由等阶段,加快了故障注入实验的速度。在这些实验中,与基于串行故障仿真的故障注入相比,获得了12倍的加速。
{"title":"A Fast and Accurate FPGA-Based Fault Injection System","authors":"Thomas Schweizer, Dustin Peterson, Johannes Maximilian Kühn, T. Kuhn, W. Rosenstiel","doi":"10.1109/FCCM.2013.47","DOIUrl":"https://doi.org/10.1109/FCCM.2013.47","url":null,"abstract":"This paper introduces an FPGA-based fault injection system. To realize this system a library was developed, which implements a static mapping between a circuit described at RTL or gate-level and its corresponding placed and routed FPGA design. The aim of this mapping is to preserve module and port structure of the placed and routed FPGA design to the RT/gate-level circuit description. To demonstrate the accuracy of this mapping the ISCAS'89 benchmark circuits and the VHDL netlist of the LEON3 system are used. The results show that about 99% of the ports in the RT/gate-level circuit description can be located in the placed and routed FPGA design. Based on this library a fault injection tool was developed to accelerate the fault injection experiment time by bypassing some stages (synthesis, placement and routing) of a re-compilation process. In these experiments a 12 × speedup was achieved when compared to fault injections based on serial fault emulation.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115256534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits fpga映射电路的精确热分布估计和验证
A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori
Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.
在设计阶段对FPGA进行精确的热分布估计,是在FPGA部署到现场运行之前避免电路中出现意外热热点的必要条件。热分布估计需要精确的动态和泄漏功率值,它们可以使用FPGA供应商的工具进行估计。然而,这些报告泄漏功率作为整个芯片的单个值,并且在文献或FPGA工具集中没有给出关于其在FPGA芯片上分布的详细信息,用于热模拟。为了解决这个问题,我们提出了一种在FPGA芯片上合理分配漏功率的方法。该方法采用温度泄漏回路估计模型对泄漏功率进行分布和调整,以实现更精确的热模拟。此外,为了准确校准所提出的方法及其模型,并验证所得到的热剖面,我们使用红外热像仪,测量Virtex-5 FPGA芯片背面的辐射。测试几种不同尺寸和频率的设计的结果表明,与相机测量相比,我们的方法可以实现准确的热剖面估计,整个芯片的平均绝对估计误差约为1°C。
{"title":"Accurate Thermal-Profile Estimation and Validation for FPGA-Mapped Circuits","authors":"A. Amouri, H. Amrouch, T. Ebi, J. Henkel, M. Tahoori","doi":"10.1109/FCCM.2013.48","DOIUrl":"https://doi.org/10.1109/FCCM.2013.48","url":null,"abstract":"Accurate thermal profile estimation for FPGA, at design time, is necessary to avoid unexpected thermal hot-spots in the circuit before deploying the FPGA to the in-field operation. Both accurate dynamic and leakage power values are needed for the thermal profile estimation and they can be estimated using the FPGA vendor's tools. However these report leakage power as a single value for the whole chip, and no details are given in literature or the FPGA toolset about its distribution across the FPGA chip for the thermal simulation. To cope with this problem, we present a method for properly distributing the leakage power across the FPGA chip. The method uses a temperature-leakage loop estimation model for distributing and adapting the leakage power for more accurate thermal simulation. Furthermore, to accurately calibrate the presented method and its model and also to validate the resulting thermal profiles, we utilize an infrared thermal camera, which measures the emissions from the backside of a Virtex-5 FPGA chip. The results of testing several designs, with different sizes and frequencies, show that our approach can achieve accurate thermal-profile estimation when compared to the camera measurements, with average absolute estimation error of around 1°C across the chip.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114787891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
On Optimizing the Arithmetic Precision of MCMC Algorithms MCMC算法的算法精度优化
Grigorios Mingas, Farhan Rahman, C. Bouganis
Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.
马尔可夫链蒙特卡罗(MCMC)是一种普遍存在的随机方法,用于从任意概率分布中抽取随机样本,例如在贝叶斯推理中遇到的随机样本。在具有高维和大规模数据的问题中,MCMC通常需要非常长的运行时间才能给出具有代表性的样本。现场可编程门阵列(fpga)由于其支持大规模并行的能力,已被证明是MCMC加速的合适平台。本文介绍了一种自动化方法,该方法可以最大限度地降低fpga映射MCMC采样器中计算最密集部分的浮点精度,同时保持输出中与精度相关的偏差在用户指定的公差范围内。该方法基于一种有效的偏差估计器,可以在少量随机样本的情况下估计输出中的偏差。优化过程包括FPGA预运行,预运行预估偏置并选择优化精度。然后,这种精度用于重新配置FPGA,以实现最终的长时间MCMC运行,从而允许更高的采样吞吐量。该过程不需要用户干预。该方法在混合模型和神经网络回归两个贝叶斯推理案例中进行了测试。通过双精度FPGA设计实现的加速是3.5 -5倍(包括优化开销)。与顺序CPU和GPGPU的比较显示,速度分别为223x-446x和16x-18x。
{"title":"On Optimizing the Arithmetic Precision of MCMC Algorithms","authors":"Grigorios Mingas, Farhan Rahman, C. Bouganis","doi":"10.1109/FCCM.2013.31","DOIUrl":"https://doi.org/10.1109/FCCM.2013.31","url":null,"abstract":"Markov Chain Monte Carlo (MCMC) is an ubiquitous stochastic method, used to draw random samples from arbitrary probability distributions, such as the ones encountered in Bayesian inference. MCMC often requires forbiddingly long runtimes to give a representative sample in problems with high dimensions and large-scale data. Field-Programmable Gate Arrays (FPGAs) have proven to be a suitable platform for MCMC acceleration due to their ability to support massive parallelism. This paper introduces an automated method, which minimizes the floating point precision of the most computationally intensive part of an FPGA-mapped MCMC sampler, while keeping the precision-related bias in the output within a user-specified tolerance. The method is based on an efficient bias estimator, proposed here, which is able to estimate the bias in the output with only few random samples. The optimization process involves FPGA pre-runs, which estimate the bias and choose the optimized precision. This precision is then used to reconfigure the FPGA for the final, long MCMC run, allowing for higher sampling throughputs. The process requires no user intervention. The method is tested on two Bayesian inference case studies: Mixture models and neural network regression. The achieved speedups over double-precision FPGA designs were 3.5x-5x (including the optimization overhead). Comparisons with a sequential CPU and a GPGPU showed speedups of 223x-446x and 16x-18x respectively.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115292366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
High Speed Video Processing Using Fine-Grained Processing on FPGA Platform 基于FPGA平台的细粒度高速视频处理
Z. Ang, Akash Kumar, Yajun Ha
This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.
本文提出了一种基于fpga的阵列处理器,对40 × 40像素的灰度视频进行拉普拉斯滤波。该体系结构由位串行像素处理器组成,这些处理器相互连接以形成二维网格阵列。该体系结构的特点是采用了局部重构的新方法,可以在数组之间来回传输数据。每个处理器占用一个可配置的逻辑块,在Virtex-6 ML605评估套件上实现每秒10000帧的目标帧率,工作频率为0.31 MHz。切片查找表的内容和Virtex-6位流格式之间的详细对应关系也被记录下来。
{"title":"High Speed Video Processing Using Fine-Grained Processing on FPGA Platform","authors":"Z. Ang, Akash Kumar, Yajun Ha","doi":"10.1109/FCCM.2013.32","DOIUrl":"https://doi.org/10.1109/FCCM.2013.32","url":null,"abstract":"This summary paper1 proposes an FPGA-based array processor which performs Laplacian filtering on a 40 by 40 pixel grayscale video. The architecture comprises of bit-serial pixel processors interconnected to give a two-dimensional mesh array. This architecture features the novel use of partial reconfiguration which transfers data to and fro the array. Each processor occupies a configurable logic block and achieves a target frame rate of 10000 frames per second, at an operating frequency of 0.31 MHz on the Virtex-6 ML605 Evaluation Kit. The detailed correspondence between the contents of slice lookup tables and the Virtex-6 bitstream format is also documented.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115460457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1