首页 > 最新文献

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Kung Fu Data Energy - Minimizing Communication Energy in FPGA Computations 功夫数据能量-最小化FPGA计算中的通信能量
E. Kadrić, K. Mahajan, A. DeHon
The energy in FPGA computations can be dominated by data communication energy, either in the form of memory references or data movement on interconnect (e.g., over 75% of energy for single processor Gaussian Mixture Modeling, Window Filtering, and FFT). In this paper, we explore how to use data placement and parallelism to reduce communication energy. We further introduce a new architecture for embedded memories, the Continuous Hierarchy Memory (CHM), and show that it increases the opportunities to reduce energy by strategic data placement. For three common FPGA tasks in signal and image processing (Gaussian Mixture Modeling, Window Filters, and FFTs), we show that data movement energy can vary over a factor of 9. The best solutions exploit parallelism and hierarchy and are 1.8-6.0× more energy-efficient than designs that place all data in a large memory bank. With the CHM, we can get an additional 10% improvement for full voltage logic and 30-80% when operating the computation at reduced voltage.
FPGA计算中的能量可以由数据通信能量主导,要么以存储器引用的形式,要么以互连上的数据移动的形式(例如,单处理器高斯混合建模,窗口滤波和FFT的能量超过75%)。在本文中,我们探讨了如何使用数据放置和并行性来减少通信能量。我们进一步介绍了一种新的嵌入式存储器架构,连续层次存储器(CHM),并表明它增加了通过战略性数据放置来减少能量的机会。对于信号和图像处理中的三个常见FPGA任务(高斯混合建模,窗口滤波器和fft),我们显示数据移动能量可以变化9倍以上。最好的解决方案利用并行性和层次结构,比将所有数据放在大型内存库中的设计节能1.8-6.0倍。使用CHM,我们可以在全电压逻辑下获得10%的额外改进,在降低电压下进行计算时可以获得30-80%的改进。
{"title":"Kung Fu Data Energy - Minimizing Communication Energy in FPGA Computations","authors":"E. Kadrić, K. Mahajan, A. DeHon","doi":"10.1109/FCCM.2014.66","DOIUrl":"https://doi.org/10.1109/FCCM.2014.66","url":null,"abstract":"The energy in FPGA computations can be dominated by data communication energy, either in the form of memory references or data movement on interconnect (e.g., over 75% of energy for single processor Gaussian Mixture Modeling, Window Filtering, and FFT). In this paper, we explore how to use data placement and parallelism to reduce communication energy. We further introduce a new architecture for embedded memories, the Continuous Hierarchy Memory (CHM), and show that it increases the opportunities to reduce energy by strategic data placement. For three common FPGA tasks in signal and image processing (Gaussian Mixture Modeling, Window Filters, and FFTs), we show that data movement energy can vary over a factor of 9. The best solutions exploit parallelism and hierarchy and are 1.8-6.0× more energy-efficient than designs that place all data in a large memory bank. With the CHM, we can get an additional 10% improvement for full voltage logic and 30-80% when operating the computation at reduced voltage.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116010305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
A Fully Pipelined and Dynamically Composable Architecture of CGRA 全流水线、动态可组合的CGRA体系结构
J. Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, Peipei Zhou
Future processor chips will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable fabrics bring higher energy efficiency than CPUs via customized hardware that adapts to user applications. Among different reconfigurable fabrics, coarse-grained reconfigurable arrays (CGRAs) can be even more efficient than fine-grained FPGAs when bit-level customization is not necessary in target applications. CGRAs were originally developed in the era when transistor resources were more critical than energy efficiency. Previous work shares hardware among different operations via modulo scheduling and time multiplexing of processing elements. In this work, we focus on an emerging scenario where transistor resources are rich. We develop a novel CGRA architecture that enables full pipelining and dynamic composition to improve energy efficiency by taking full advantage of abundant transistors. Several new design challenges are solved. We implement a prototype of the proposed architecture in a commodity FPGA chip for verification. Experiments show that our architecture can fully exploit the energy benefits of customization for user applications in the scenario of rich transistor resources.
未来的处理器芯片将不受晶体管资源的限制,而主要受能效的限制。可重构结构通过定制硬件来适应用户应用,从而带来比cpu更高的能源效率。在不同的可重构结构中,当目标应用程序中不需要进行位级定制时,粗粒度可重构阵列(CGRAs)甚至可以比细粒度fpga更高效。CGRAs最初是在晶体管资源比能源效率更重要的时代开发的。以前的工作是通过模调度和处理元素的时间复用在不同的操作之间共享硬件。在这项工作中,我们专注于晶体管资源丰富的新兴场景。我们开发了一种新颖的CGRA架构,该架构实现了全流水线和动态组成,通过充分利用丰富的晶体管来提高能效。解决了几个新的设计挑战。我们在商用FPGA芯片上实现了所提出架构的原型以进行验证。实验表明,在晶体管资源丰富的情况下,我们的架构可以充分利用用户应用定制的能源优势。
{"title":"A Fully Pipelined and Dynamically Composable Architecture of CGRA","authors":"J. Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, Peipei Zhou","doi":"10.1109/FCCM.2014.12","DOIUrl":"https://doi.org/10.1109/FCCM.2014.12","url":null,"abstract":"Future processor chips will not be limited by the transistor resources, but will be mainly constrained by energy efficiency. Reconfigurable fabrics bring higher energy efficiency than CPUs via customized hardware that adapts to user applications. Among different reconfigurable fabrics, coarse-grained reconfigurable arrays (CGRAs) can be even more efficient than fine-grained FPGAs when bit-level customization is not necessary in target applications. CGRAs were originally developed in the era when transistor resources were more critical than energy efficiency. Previous work shares hardware among different operations via modulo scheduling and time multiplexing of processing elements. In this work, we focus on an emerging scenario where transistor resources are rich. We develop a novel CGRA architecture that enables full pipelining and dynamic composition to improve energy efficiency by taking full advantage of abundant transistors. Several new design challenges are solved. We implement a prototype of the proposed architecture in a commodity FPGA chip for verification. Experiments show that our architecture can fully exploit the energy benefits of customization for user applications in the scenario of rich transistor resources.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125208214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Fast, Power-Efficient Biophotonic Simulations for Cancer Treatment Using FPGAs 利用fpga快速、高效的生物光子模拟癌症治疗
Jeffrey Cassidy, L. Lilge, Vaughn Betz
Biophotonics, the study of light propagation through living tissue, is important for many medical applications ranging from imaging and detection through therapy for conditions such as cancer. Effective medical use of light depends on simulating its propagation through highly-scattering tissue. Monte Carlo simulation of photon migration has been adopted as the “gold standard” for its ability to capture complicated geometries and model all of the relevant problem physics. This accuracy and generality comes at a high computational cost, which limits the technique's utility. Greatly generalizing previous work, we present the first and only hardware-accelerated Monte Carlo biophotonic simulator that can accept complicated geometries described by tetrahedral meshes. Implemented on an Altera Stratix V FPGA, it achieves high performance (4x) and extremely high energy efficiency (67x) compared to a tightly-optimized multi-threaded CPU implementation, with demonstrated potential to expand the performance gains even further to 15-20x, which would enable important clinical and research applications.
生物光子学,研究光在活组织中的传播,对许多医学应用都很重要,从成像和检测到癌症等疾病的治疗。光的有效医疗利用取决于模拟其在高散射组织中的传播。蒙特卡罗模拟光子迁移已被采用为“黄金标准”,因为它能够捕获复杂的几何形状和模拟所有相关的物理问题。这种准确性和通用性需要很高的计算成本,这限制了该技术的实用性。大大推广以前的工作,我们提出了第一个也是唯一的硬件加速蒙特卡罗生物光子模拟器,可以接受由四面体网格描述的复杂几何形状。在Altera Stratix V FPGA上实现,与严格优化的多线程CPU实现相比,它实现了高性能(4倍)和极高的能效(67倍),并证明了将性能提升进一步扩展到15-20倍的潜力,这将使重要的临床和研究应用成为可能。
{"title":"Fast, Power-Efficient Biophotonic Simulations for Cancer Treatment Using FPGAs","authors":"Jeffrey Cassidy, L. Lilge, Vaughn Betz","doi":"10.1109/FCCM.2014.45","DOIUrl":"https://doi.org/10.1109/FCCM.2014.45","url":null,"abstract":"Biophotonics, the study of light propagation through living tissue, is important for many medical applications ranging from imaging and detection through therapy for conditions such as cancer. Effective medical use of light depends on simulating its propagation through highly-scattering tissue. Monte Carlo simulation of photon migration has been adopted as the “gold standard” for its ability to capture complicated geometries and model all of the relevant problem physics. This accuracy and generality comes at a high computational cost, which limits the technique's utility. Greatly generalizing previous work, we present the first and only hardware-accelerated Monte Carlo biophotonic simulator that can accept complicated geometries described by tetrahedral meshes. Implemented on an Altera Stratix V FPGA, it achieves high performance (4x) and extremely high energy efficiency (67x) compared to a tightly-optimized multi-threaded CPU implementation, with demonstrated potential to expand the performance gains even further to 15-20x, which would enable important clinical and research applications.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Compiling Higher Order Functional Programs to Composable Digital Hardware 编译高阶函数程序到可组合的数字硬件
E. Aguilar-Pelaez, Samuel Bayliss, Alex I. Smith, F. Winterstein, D. Ghica, David B. Thomas, G. Constantinides
This work demonstrates the capabilities of a high-level synthesis tool-chain that allows the compilation of higher order functional programs to gate-level hardware descriptions. Higher order programming allows functions to take functions as parameters. In a hardware context, the latency-insensitive interfaces generated between compiled modules enable late-binding with libraries of pre-existing functions at the place-and-route compilation stage. We demonstrate the completeness and utility of our approach using a case study; a recursive k-means clustering algorithm. The algorithm features complex data-dependent control flow and opportunities to exploit both coarse and fine-grained parallelism.
这项工作展示了一个高级合成工具链的能力,它允许编译高阶功能程序来描述门级硬件。高阶编程允许函数接受函数作为参数。在硬件上下文中,编译模块之间生成的对延迟不敏感的接口允许在位置和路由编译阶段与预先存在的函数库进行后期绑定。我们通过案例研究证明了我们方法的完整性和实用性;一种递归k均值聚类算法。该算法具有复杂的数据依赖控制流和利用粗粒度和细粒度并行性的机会。
{"title":"Compiling Higher Order Functional Programs to Composable Digital Hardware","authors":"E. Aguilar-Pelaez, Samuel Bayliss, Alex I. Smith, F. Winterstein, D. Ghica, David B. Thomas, G. Constantinides","doi":"10.1109/FCCM.2014.69","DOIUrl":"https://doi.org/10.1109/FCCM.2014.69","url":null,"abstract":"This work demonstrates the capabilities of a high-level synthesis tool-chain that allows the compilation of higher order functional programs to gate-level hardware descriptions. Higher order programming allows functions to take functions as parameters. In a hardware context, the latency-insensitive interfaces generated between compiled modules enable late-binding with libraries of pre-existing functions at the place-and-route compilation stage. We demonstrate the completeness and utility of our approach using a case study; a recursive k-means clustering algorithm. The algorithm features complex data-dependent control flow and opportunities to exploit both coarse and fine-grained parallelism.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125089477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers 二进制有符号数字的无携带加法新算法
K. Schneider, Adrian Willenbücher
Signed-digit (SD) numbers generalize traditional radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representations that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction of n-digit numbers are performed with circuits of depth O(1) and size O(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at position i depend only on the summands' digits at positions i and i - 1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two's complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations.
sign -digit (SD)数是传统基数的推广,允许在一定范围内出现负数。通常,这会导致冗余的数字表示,可用于避免基数加法的进位传播问题。不幸的是,正如Avizienis所证明的那样,SD数的无进位加法的标准算法不适用于二进制情况。因此,本文构造了二进制SD数无进位加减的特殊算法,即用深度为O(1)、大小为O(n)的电路进行n位数的加减运算。这可以通过计算标准算法使用的传输数字之外的一个额外的位来实现,这使我们能够区分相关的情况,以避免依赖关系的传播。用于计算位置i的和位数的附加位和传输位仅取决于位置i和i - 1的和位数,因此所有和位数都可以用深度与位数无关的硬件电路计算。我们首先解释标准加法算法的基础知识,以派生出针对二进制情况修正算法所需的附加信息。在证明了我们算法的正确性之后,我们给出的实验结果表明,即使对于较小的数字,我们的实现也明显优于2的补码相加,并且与其他无携带实现相比,节省了50%的所需芯片面积。
{"title":"A New Algorithm for Carry-Free Addition of Binary Signed-Digit Numbers","authors":"K. Schneider, Adrian Willenbücher","doi":"10.1109/FCCM.2014.24","DOIUrl":"https://doi.org/10.1109/FCCM.2014.24","url":null,"abstract":"Signed-digit (SD) numbers generalize traditional radix numbers by allowing negative digits within a certain range. Typically, this leads to redundant number representations that can be used to avoid the carry propagation problem of addition of radix numbers. Unfortunately, as proved by Avizienis, the standard algorithm for carry-free addition of SD numbers does not work for the binary case. In this paper, we therefore construct a special algorithm for the carry-free addition and subtraction of binary SD numbers, i.e., addition and subtraction of n-digit numbers are performed with circuits of depth O(1) and size O(n). This is possible by computing in addition to the transfer digits used by the standard algorithm one additional bit that allows us to distinguish relevant cases to avoid propagation of dependencies. The additional bit and the transfer digit used to compute the sum digit at position i depend only on the summands' digits at positions i and i - 1 so that all sum digits can be computed with a hardware circuit of a depth that is independent of the number of digits. We first explain the basics of the standard addition algorithm to derive the additional information needed to fix the algorithm for the binary case. After proving the correctness of our algorithm, we present experimental results that show that our implementation clearly outperforms two's complement addition even for small numbers, and saves 50% of the required chip area compared to other carry-free implementations.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116557388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Experiments in Mapping Expressions to DSP Blocks 将表达式映射到DSP块的实验
Bajaj Ronak, Suhaib A. Fahmy
Mapping complex mathematical expressions to DSP blocks by relying on synthesis from pipelined code is inefficient and results in significantly reduced throughput. We have developed a tool to demonstrate the benefit of considering the structure and pipeline arrangement of the DSP block in mapping of functions. Implementations where the structure of the DSP block is considered during pipelining achieve double the throughput of other methods, demonstrating that the structure of the DSP block must be considered when scheduling complex expressions.
通过依赖于流水线代码的合成来映射复杂的数学表达式到DSP块是低效的,并且会导致吞吐量显著降低。我们开发了一个工具来演示在功能映射中考虑DSP块的结构和管道排列的好处。在流水线过程中考虑DSP块结构的实现实现的吞吐量是其他方法的两倍,这表明在调度复杂表达式时必须考虑DSP块的结构。
{"title":"Experiments in Mapping Expressions to DSP Blocks","authors":"Bajaj Ronak, Suhaib A. Fahmy","doi":"10.1109/FCCM.2014.34","DOIUrl":"https://doi.org/10.1109/FCCM.2014.34","url":null,"abstract":"Mapping complex mathematical expressions to DSP blocks by relying on synthesis from pipelined code is inefficient and results in significantly reduced throughput. We have developed a tool to demonstrate the benefit of considering the structure and pipeline arrangement of the DSP block in mapping of functions. Implementations where the structure of the DSP block is considered during pipelining achieve double the throughput of other methods, demonstrating that the structure of the DSP block must be considered when scheduling complex expressions.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132194302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FPGA Implementation of EM Algorithm for 3D CT Reconstruction 三维CT重建中EM算法的FPGA实现
Young-kyu Choi, J. Cong, Di Wu
Although the expectation maximization (EM)based 3D computed tomography (CT) reconstruction algorithm lowers radiation exposure, its long execution time hinders practical usage. To accelerate this process, we introduce a novel external memory bandwidth reduction strategy by reusing both the sinogram and the voxel intensity. Also, a customized computing engine based on field-programmable gate array (FPGA) is presented to increase the effective memory bandwidth. Experiments on actual patient data show that 85X speedup can be achieved over single-threaded CPU.
基于期望最大化(EM)的三维计算机断层扫描(CT)重建算法虽然降低了辐射暴露,但其较长的执行时间阻碍了实际应用。为了加速这一过程,我们引入了一种新的外部存储器带宽减少策略,通过重用正弦图和体素强度。此外,还提出了一种基于现场可编程门阵列(FPGA)的定制计算引擎,以提高有效存储带宽。对实际患者数据的实验表明,在单线程CPU上可以实现85倍的加速。
{"title":"FPGA Implementation of EM Algorithm for 3D CT Reconstruction","authors":"Young-kyu Choi, J. Cong, Di Wu","doi":"10.1109/FCCM.2014.48","DOIUrl":"https://doi.org/10.1109/FCCM.2014.48","url":null,"abstract":"Although the expectation maximization (EM)based 3D computed tomography (CT) reconstruction algorithm lowers radiation exposure, its long execution time hinders practical usage. To accelerate this process, we introduce a novel external memory bandwidth reduction strategy by reusing both the sinogram and the voxel intensity. Also, a customized computing engine based on field-programmable gate array (FPGA) is presented to increase the effective memory bandwidth. Experiments on actual patient data show that 85X speedup can be achieved over single-threaded CPU.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130555550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Mapping Tasks to a Dynamically Reconfigurable Coarse Grained Array 将任务映射到动态可重构的粗粒度数组
M. S. Moghaddam, K. Paul, M. Balakrishnan
Coarse-Grained Reconfigurable Architectures (CGRAs) have become popular in recent times as the increased transistor densities have enabled greater integration of increasingly complex “compute cores”. These devices pack massive compute power and can be effectively used to build efficient solutions for applications which have a significant degree of parallelism. In many cases, these CGRAs are also partially reconfigurable. Clearly to make effective use of these highly “parallel compute platforms”, a good mapping flow is required to map the parallelism that is present in a target application.
随着晶体管密度的增加,使得越来越复杂的“计算核心”能够实现更大的集成,粗粒度可重构架构(CGRAs)近年来变得流行。这些设备包含大量的计算能力,可以有效地用于为具有显著并行度的应用程序构建高效的解决方案。在许多情况下,这些CGRAs也是部分可重构的。显然,为了有效地利用这些高度“并行计算平台”,需要一个良好的映射流来映射目标应用程序中存在的并行性。
{"title":"Mapping Tasks to a Dynamically Reconfigurable Coarse Grained Array","authors":"M. S. Moghaddam, K. Paul, M. Balakrishnan","doi":"10.1109/FCCM.2014.20","DOIUrl":"https://doi.org/10.1109/FCCM.2014.20","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) have become popular in recent times as the increased transistor densities have enabled greater integration of increasingly complex “compute cores”. These devices pack massive compute power and can be effectively used to build efficient solutions for applications which have a significant degree of parallelism. In many cases, these CGRAs are also partially reconfigurable. Clearly to make effective use of these highly “parallel compute platforms”, a good mapping flow is required to map the parallelism that is present in a target application.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122125006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GraphGen: An FPGA Framework for Vertex-Centric Graph Computation GraphGen:一个以顶点为中心的图形计算的FPGA框架
E. Nurvitadhi, G. Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, J. Hoe, José F. Martínez, Carlos Guestrin
Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor and memory system for the target FPGA platform. We report design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGA boards. Results show up to 14.6× and 2.9× speedups over software on Intel Core i7 CPU for the two applications, respectively.
以顶点为中心的图计算被广泛应用于许多基于图数据结构的机器学习和数据挖掘应用中。本文介绍了GraphGen,一个以FPGA为目标的以顶点为中心的框架,用于图形计算的硬件加速。GraphGen接受以顶点为中心的图形规范,并自动将其编译到特定于应用程序的合成图形处理器和目标FPGA平台的内存系统中。我们报告了使用GraphGen在Terasic DE4和Xilinx ML605 FPGA板上实现立体匹配和手写识别图形应用的设计案例研究。结果显示,这两个应用程序的速度分别比英特尔酷睿i7 CPU上的软件提高了14.6倍和2.9倍。
{"title":"GraphGen: An FPGA Framework for Vertex-Centric Graph Computation","authors":"E. Nurvitadhi, G. Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, J. Hoe, José F. Martínez, Carlos Guestrin","doi":"10.1109/FCCM.2014.15","DOIUrl":"https://doi.org/10.1109/FCCM.2014.15","url":null,"abstract":"Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor and memory system for the target FPGA platform. We report design case studies using GraphGen to implement stereo matching and handwriting recognition graph applications on Terasic DE4 and Xilinx ML605 FPGA boards. Results show up to 14.6× and 2.9× speedups over software on Intel Core i7 CPU for the two applications, respectively.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
Building Optimized Packet Filters with COFFi 用COFFi构建优化包过滤器
Sven Hager, F. Winkler, B. Scheuermann, Klaus Reinhardt
Many companies and institutions employ packet filter firewalls in order to effectively regulate network traffic. Unfortunately, the constant growth of network bandwidth makes the task of matching packet headers against potentially large rulesets more difficult, and prohibits the sole use of entirely software-based firewalls which cannot cope with such huge amounts of traffic. Instead, high-speed firewalls are often implemented in ASICs which offer a high degree of parallelism, many opportunities for operation pipelining, and low-latency access to network data. However, due to their static nature, ASICs must provide generic filtering circuitry that is hardly able to take full advantage of firewall ruleset properties, thus leading to a waste of hardware resources.
为了有效地管理网络流量,许多公司和机构采用包过滤防火墙。不幸的是,网络带宽的不断增长使得将数据包头与潜在的大型规则集进行匹配的任务变得更加困难,并且禁止完全基于软件的防火墙的单独使用,这些防火墙无法处理如此巨大的流量。相反,高速防火墙通常在asic中实现,asic提供了高度的并行性,许多操作流水线的机会,以及对网络数据的低延迟访问。然而,由于其静态特性,asic必须提供通用滤波电路,这很难充分利用防火墙规则集的属性,从而导致硬件资源的浪费。
{"title":"Building Optimized Packet Filters with COFFi","authors":"Sven Hager, F. Winkler, B. Scheuermann, Klaus Reinhardt","doi":"10.1109/FCCM.2014.38","DOIUrl":"https://doi.org/10.1109/FCCM.2014.38","url":null,"abstract":"Many companies and institutions employ packet filter firewalls in order to effectively regulate network traffic. Unfortunately, the constant growth of network bandwidth makes the task of matching packet headers against potentially large rulesets more difficult, and prohibits the sole use of entirely software-based firewalls which cannot cope with such huge amounts of traffic. Instead, high-speed firewalls are often implemented in ASICs which offer a high degree of parallelism, many opportunities for operation pipelining, and low-latency access to network data. However, due to their static nature, ASICs must provide generic filtering circuitry that is hardly able to take full advantage of firewall ruleset properties, thus leading to a waste of hardware resources.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129301929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1