首页 > 最新文献

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献

英文 中文
Accelerating parameter estimation for multivariate self-exciting point processes 多元自激点过程的加速参数估计
Ce Guo, W. Luk
Self-exciting point processes are stochastic processes capturing occurrence patterns of random events. They offer powerful tools to describe and predict temporal distributions of random events like stock trading and neurone spiking. A critical calculation in self-exciting point process models is parameter estimation, which fits a model to a data set. This calculation is computationally demanding when the number of data points is large and when the data dimension is high. This paper proposes the first reconfigurable computing solution to accelerate this calculation. We derive an acceleration strategy in a mathematical specification by eliminating complex data dependency, by cutting hardware resource requirement, and by parallelising arithmetic operations. In our experimental evaluation, an FPGA-based implementation of the proposed solution is up to 79 times faster than one CPU core, and 13 times faster than the same CPU with eight cores.
自激点过程是捕获随机事件发生模式的随机过程。它们提供了强大的工具来描述和预测随机事件的时间分布,比如股票交易和神经元峰值。自激点过程模型的一个关键计算是参数估计,它将模型拟合到数据集上。当数据点数量大且数据维数高时,此计算对计算量要求很高。本文提出了第一种可重构计算方案来加速这一计算。我们通过消除复杂的数据依赖、减少硬件资源需求和并行算术运算来推导数学规范中的加速策略。在我们的实验评估中,基于fpga的实现所提出的解决方案比一个CPU核心快79倍,比具有8核的相同CPU快13倍。
{"title":"Accelerating parameter estimation for multivariate self-exciting point processes","authors":"Ce Guo, W. Luk","doi":"10.1145/2554688.2554765","DOIUrl":"https://doi.org/10.1145/2554688.2554765","url":null,"abstract":"Self-exciting point processes are stochastic processes capturing occurrence patterns of random events. They offer powerful tools to describe and predict temporal distributions of random events like stock trading and neurone spiking. A critical calculation in self-exciting point process models is parameter estimation, which fits a model to a data set. This calculation is computationally demanding when the number of data points is large and when the data dimension is high. This paper proposes the first reconfigurable computing solution to accelerate this calculation. We derive an acceleration strategy in a mathematical specification by eliminating complex data dependency, by cutting hardware resource requirement, and by parallelising arithmetic operations. In our experimental evaluation, an FPGA-based implementation of the proposed solution is up to 79 times faster than one CPU core, and 13 times faster than the same CPU with eight cores.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123212208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Hardware acceleration of database operations 数据库操作的硬件加速
J. Casper, K. Olukotun
As the amount of memory in database systems grows, entire database tables, or even databases, are able to fit in the system's memory, making in-memory database operations more prevalent. This shift from disk-based to in-memory database systems has contributed to a move from row-wise to columnar data storage. Furthermore, common database workloads have grown beyond online transaction processing (OLTP) to include online analytical processing and data mining. These workloads analyze huge datasets that are often irregular and not indexed, making traditional database operations like joins much more expensive. In this paper we explore using dedicated hardware to accelerate in-memory database operations. We present hardware to accelerate the selection process of compacting a single column into a linear column of selected data, joining two sorted columns via merging, and sorting a column. Finally, we put these primitives together to accelerate an entire join operation. We implement a prototype of this system using FPGAs and show substantial improvements in both absolute throughput and utilization of memory bandwidth. Using the prototype as a guide, we explore how the hardware resources required by our design change with the desired throughput.
随着数据库系统中内存量的增长,整个数据库表甚至数据库都可以装入系统内存,从而使内存中的数据库操作更加普遍。这种从基于磁盘的数据库系统到内存数据库系统的转变促成了从逐行数据存储到列数据存储的转变。此外,常见的数据库工作负载已经超出了在线事务处理(OLTP)的范围,包括在线分析处理和数据挖掘。这些工作负载分析的庞大数据集通常是不规则的,而且没有索引,这使得传统的数据库操作(如连接)的成本要高得多。在本文中,我们探索使用专用硬件来加速内存中的数据库操作。我们提供硬件来加速选择过程,包括将单个列压缩为所选数据的线性列,通过合并连接两个排序的列,以及对列进行排序。最后,我们将这些原语放在一起以加速整个连接操作。我们使用fpga实现了该系统的原型,并在绝对吞吐量和内存带宽利用率方面都有了实质性的改进。以原型为指导,我们将探索设计所需的硬件资源如何随着期望的吞吐量而变化。
{"title":"Hardware acceleration of database operations","authors":"J. Casper, K. Olukotun","doi":"10.1145/2554688.2554787","DOIUrl":"https://doi.org/10.1145/2554688.2554787","url":null,"abstract":"As the amount of memory in database systems grows, entire database tables, or even databases, are able to fit in the system's memory, making in-memory database operations more prevalent. This shift from disk-based to in-memory database systems has contributed to a move from row-wise to columnar data storage. Furthermore, common database workloads have grown beyond online transaction processing (OLTP) to include online analytical processing and data mining. These workloads analyze huge datasets that are often irregular and not indexed, making traditional database operations like joins much more expensive. In this paper we explore using dedicated hardware to accelerate in-memory database operations. We present hardware to accelerate the selection process of compacting a single column into a linear column of selected data, joining two sorted columns via merging, and sorting a column. Finally, we put these primitives together to accelerate an entire join operation. We implement a prototype of this system using FPGAs and show substantial improvements in both absolute throughput and utilization of memory bandwidth. Using the prototype as a guide, we explore how the hardware resources required by our design change with the desired throughput.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 173
Revisiting and-inverter cones 重访和逆变锥
Grace Zgheib, Liqun Yang, Zhihong Huang, D. Novo, H. Parandeh-Afshar, Haigang Yang, P. Ienne
And-Invert Cones (AICs) have been suggested as an alternative to the ubiquitous Look-Up Tables (LUTs) used in commercial FPGAs. The original article suggesting the new architecture made some untested assumptions on the circuitry needed to implement AIC architectures and did not develop completely the toolset necessary to assess comprehensively the idea. In this paper, we pick up the architecture that some of us proposed in the original AIC paper and try to implement it as thoroughly as we can afford. We build all components for the logic cluster at transistor level in a 40~nm technology as well as a LUT-based architecture inspired by Altera's Stratix~IV. We first determine that the characteristics of our LUT-based architecture are reasonably similar to those of the commercial counterpart. Then, we compare the AIC architecture to the baseline on a number of benchmarks, and we find a few difficulties that had been overlooked before. We thus explore other design possibilities around the original design point and show their detailed impact. Finally, we discuss how the very structure of current logic clusters seems not perfectly appropriate for getting the best out of AICs and conclude that, even though they are not confirmed as an immediate blessing today, AICs still offer rich research opportunities.
倒置锥(aic)已被建议作为商业fpga中普遍使用的查找表(lut)的替代方案。提出新架构的原始文章对实现AIC架构所需的电路做了一些未经测试的假设,并且没有开发出全面评估该想法所需的完整工具集。在本文中,我们选择了我们中的一些人在原始AIC论文中提出的体系结构,并尝试尽可能彻底地实现它。我们采用40~nm技术在晶体管级构建逻辑集群的所有组件,以及受Altera的Stratix~IV启发的基于lut的架构。我们首先确定我们的基于lut的体系结构的特征与那些商业对应物的特征相当相似。然后,我们将AIC体系结构与许多基准测试的基线进行比较,我们发现了一些以前被忽视的困难。因此,我们围绕原始设计点探索其他设计可能性,并展示它们的详细影响。最后,我们讨论了当前逻辑集群的结构似乎并不完全适合从aic中获得最佳效果,并得出结论,即使它们没有被确认为今天的直接祝福,aic仍然提供了丰富的研究机会。
{"title":"Revisiting and-inverter cones","authors":"Grace Zgheib, Liqun Yang, Zhihong Huang, D. Novo, H. Parandeh-Afshar, Haigang Yang, P. Ienne","doi":"10.1145/2554688.2554791","DOIUrl":"https://doi.org/10.1145/2554688.2554791","url":null,"abstract":"And-Invert Cones (AICs) have been suggested as an alternative to the ubiquitous Look-Up Tables (LUTs) used in commercial FPGAs. The original article suggesting the new architecture made some untested assumptions on the circuitry needed to implement AIC architectures and did not develop completely the toolset necessary to assess comprehensively the idea. In this paper, we pick up the architecture that some of us proposed in the original AIC paper and try to implement it as thoroughly as we can afford. We build all components for the logic cluster at transistor level in a 40~nm technology as well as a LUT-based architecture inspired by Altera's Stratix~IV. We first determine that the characteristics of our LUT-based architecture are reasonably similar to those of the commercial counterpart. Then, we compare the AIC architecture to the baseline on a number of benchmarks, and we find a few difficulties that had been overlooked before. We thus explore other design possibilities around the original design point and show their detailed impact. Finally, we discuss how the very structure of current logic clusters seems not perfectly appropriate for getting the best out of AICs and conclude that, even though they are not confirmed as an immediate blessing today, AICs still offer rich research opportunities.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114396523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Methodology to generate multi-dimensional systolic arrays for FPGAs using openCL (abstract only) 使用openCL为fpga生成多维收缩数组的方法(仅摘要)
Nick Ni
Systolic arrays (SA) in a FPGA provide a significant speed up on many scientific calculations through massive parallelism exploitation. The low-level hardware design of such complex SA is becoming more time-consuming and non-scalable with more transistors being available on a single chip. In this paper we present a novel methodology to generate multi-dimensional SA for FPGAs using a well-accepted high-level language, OpenCL. Kernels written in OpenCL can then be compiled directly into hardware using an OpenCL high-level synthesis tool. A complex case study using our methodology is presented. We were able to design, generate, verify and optimize the entire FPGA based hardware accelerator using the Smith-Waterman, in only three man weeks. The accelerator's top performance was 32.6 GCUPS (Giga-Cell-Updates-Per-Second) on a DNA similarity search with 1.3 GCUPS/watt efficiency. The result is superior to most state-of-the-art CPU/GPU implementations and competitive against a hand-crafted hardware design which took many months to develop.
FPGA中的收缩阵列(SA)通过大规模的并行性开发,为许多科学计算提供了显著的速度。随着单个芯片上可用的晶体管越来越多,这种复杂SA的底层硬件设计变得越来越耗时和不可扩展。在本文中,我们提出了一种使用公认的高级语言OpenCL为fpga生成多维SA的新方法。用OpenCL编写的内核可以使用OpenCL高级合成工具直接编译成硬件。使用我们的方法提出了一个复杂的案例研究。我们能够使用Smith-Waterman在短短三周内设计,生成,验证和优化整个基于FPGA的硬件加速器。该加速器在DNA相似性搜索上的最高性能为32.6 GCUPS(每秒千兆细胞更新),效率为1.3 GCUPS/瓦特。其结果优于大多数最先进的CPU/GPU实现,并与花了数月时间开发的手工制作硬件设计相竞争。
{"title":"Methodology to generate multi-dimensional systolic arrays for FPGAs using openCL (abstract only)","authors":"Nick Ni","doi":"10.1145/2554688.2554750","DOIUrl":"https://doi.org/10.1145/2554688.2554750","url":null,"abstract":"Systolic arrays (SA) in a FPGA provide a significant speed up on many scientific calculations through massive parallelism exploitation. The low-level hardware design of such complex SA is becoming more time-consuming and non-scalable with more transistors being available on a single chip. In this paper we present a novel methodology to generate multi-dimensional SA for FPGAs using a well-accepted high-level language, OpenCL. Kernels written in OpenCL can then be compiled directly into hardware using an OpenCL high-level synthesis tool. A complex case study using our methodology is presented. We were able to design, generate, verify and optimize the entire FPGA based hardware accelerator using the Smith-Waterman, in only three man weeks. The accelerator's top performance was 32.6 GCUPS (Giga-Cell-Updates-Per-Second) on a DNA similarity search with 1.3 GCUPS/watt efficiency. The result is superior to most state-of-the-art CPU/GPU implementations and competitive against a hand-crafted hardware design which took many months to develop.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124091001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Physical design 会话细节:物理设计
Jonathan Rose
{"title":"Session details: Physical design","authors":"Jonathan Rose","doi":"10.1145/3260936","DOIUrl":"https://doi.org/10.1145/3260936","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131287877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
1K manycore FPGA shared memory architecture for SOC (abstract only) 用于SOC的1K多核FPGA共享内存架构(仅抽象)
Y. Ben-Asher, Jacob Gendel, Gadi Haber, Oren Segal, Yousef Shajrawi
Manycore shared memory architectures hold a significant premise to speed up and simplify SOCs. Using many homogeneous small-cores will allow replacing the hardware accelerators of SOCs by parallel algorithms communicating through shared memory. Currently shared memory is realized by maintaining cache-consistency across the cores, caching all the connected cores to one main memory module. This approach, though used today, is not likely to be scalable enough to support the high number of cores needed for highly parallel SOCs. Therefore we consider a theoretical scheme for shared memory wherein: the shared address space is divided between a set of memory modules; and a communication network allows each core to access every such module in parallel. Load-balancing between the memory modules is obtained by rehashing the memory address-space. We have designed a simple generic shared memory architecture, synthesized it to 2,4,8,,..1024-cores for FPGA virtex-7 and evaluated it on several parallel programs. The synthesis results and the execution measurements show that, for the FPGA, all problematic aspects of this construction can be resolved. For example, unlike ASICs, the growing complexity of the communication network is absorbed by the FPGA's routing grid and by its routing mechanism. This makes this type of architectures particularly suitable for FPGAs. We used 32-bits modified PACOBLAZE cores and tested different parameters of this architecture verifying its ability to achieve high speedups. The results suggest that re-hashing is not essential and one hash-function suffice (compared to the family of universal hash functions that is needed by the theoretical construction).
多核共享内存架构是加速和简化soc的重要前提。使用许多同质小核将允许通过共享内存通信的并行算法取代soc的硬件加速器。目前,共享内存是通过维护核心之间的缓存一致性来实现的,将所有连接的核心缓存到一个主内存模块。这种方法虽然在今天使用,但不太可能具有足够的可扩展性来支持高度并行soc所需的大量内核。因此,我们考虑了一种共享内存的理论方案,其中:共享地址空间在一组内存模块之间划分;通信网络允许每个核心并行访问每个这样的模块。通过重新散列内存地址空间来获得内存模块之间的负载平衡。我们设计了一个简单的通用共享内存体系结构,并将其合成为2,4,8,…FPGA virtex-7的1024核,并在几个并行程序上进行了评估。综合结果和执行测量表明,对于FPGA来说,这种结构的所有问题都可以得到解决。例如,与asic不同,FPGA的路由网格及其路由机制吸收了通信网络日益增长的复杂性。这使得这种类型的架构特别适合fpga。我们使用32位修改的PACOBLAZE内核,并测试了该架构的不同参数,以验证其实现高速的能力。结果表明,重新哈希不是必需的,一个哈希函数就足够了(与理论构造所需的通用哈希函数族相比)。
{"title":"1K manycore FPGA shared memory architecture for SOC (abstract only)","authors":"Y. Ben-Asher, Jacob Gendel, Gadi Haber, Oren Segal, Yousef Shajrawi","doi":"10.1145/2554688.2554699","DOIUrl":"https://doi.org/10.1145/2554688.2554699","url":null,"abstract":"Manycore shared memory architectures hold a significant premise to speed up and simplify SOCs. Using many homogeneous small-cores will allow replacing the hardware accelerators of SOCs by parallel algorithms communicating through shared memory. Currently shared memory is realized by maintaining cache-consistency across the cores, caching all the connected cores to one main memory module. This approach, though used today, is not likely to be scalable enough to support the high number of cores needed for highly parallel SOCs. Therefore we consider a theoretical scheme for shared memory wherein: the shared address space is divided between a set of memory modules; and a communication network allows each core to access every such module in parallel. Load-balancing between the memory modules is obtained by rehashing the memory address-space. We have designed a simple generic shared memory architecture, synthesized it to 2,4,8,,..1024-cores for FPGA virtex-7 and evaluated it on several parallel programs. The synthesis results and the execution measurements show that, for the FPGA, all problematic aspects of this construction can be resolved. For example, unlike ASICs, the growing complexity of the communication network is absorbed by the FPGA's routing grid and by its routing mechanism. This makes this type of architectures particularly suitable for FPGAs. We used 32-bits modified PACOBLAZE cores and tested different parameters of this architecture verifying its ability to achieve high speedups. The results suggest that re-hashing is not essential and one hash-function suffice (compared to the family of universal hash functions that is needed by the theoretical construction).","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combining computation and communication optimizations in system synthesis for streaming applications 流应用系统综合中计算与通信优化的结合
J. Cong, Muhuan Huang, Peng Zhang
Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.
在视频处理、信号处理和无线通信等应用领域,数据流是一种利用任务级并行性的技术。在本文中,我们提出了一个有效的系统级合成流程,将流应用映射到fpga上,同时考虑到同时计算和通信优化。流系统的吞吐量不仅受到计算内核的性能和副本数量的显著影响,还受到内核之间通信分配的缓冲区大小的显著影响。一般来说,模块选择/复制和缓冲区大小优化在以前的工作中分别解决。我们的方法在系统调度中结合了这些优化,从而在所需的吞吐量约束下最大限度地减少了逻辑和内存的面积成本。首先提出了一种基于整数线性规划(ILP)的组合问题的最优解。然后,我们提出了一种迭代算法,该算法可以获得接近最优的结果质量,但在大型和复杂设计的算法可扩展性方面有显着提高。关键的贡献是我们有一个多项式时间算法用于精确的可调度性检查问题和一个多项式时间算法,以更好的模块实现和缓冲区大小优化来提高系统性能。实验结果表明,在相同的性能要求下,与模块选择/复制和缓冲区大小优化的单独方案相比,组合优化方案平均可节省62%的面积。此外,我们的启发式算法在运行时可以实现2到3个数量级的加速,与ILP的最优解决方案相比,面积开销不到10%。
{"title":"Combining computation and communication optimizations in system synthesis for streaming applications","authors":"J. Cong, Muhuan Huang, Peng Zhang","doi":"10.1145/2554688.2554771","DOIUrl":"https://doi.org/10.1145/2554688.2554771","url":null,"abstract":"Data streaming is a widely-used technique to exploit task-level parallelism in many application domains such as video processing, signal processing and wireless communication. In this paper we propose an efficient system-level synthesis flow to map streaming applications onto FPGAs with consideration of simultaneous computation and communication optimizations. The throughput of a streaming system is significantly impacted by not only the performance and number of replicas of the computation kernels, but also the buffer size allocated for the communications between kernels. In general, module selection/replication and buffer size optimization were addressed separately in previous work. Our approach combines these optimizations together in system scheduling which minimizes the area cost for both logic and memory under the required throughput constraint. We first propose an integer linear program (ILP) based solution to the combined problem which has the optimal quality of results. Then we propose an iterative algorithm which can achieve the near-optimal quality of results but has a significant improvement on the algorithm scalability for large and complex designs. The key contribution is that we have a polynomial-time algorithm for an exact schedulability checking problem and a polynomial-time algorithm to improve the system performance with better module implementation and buffer size optimization. Experimental results show that compared to the separate scheme of module select/replication and buffer size optimization, the combined optimization scheme can gain 62% area saving on average under the same performance requirements. Moreover, our heuristic can achieve 2 to 3 orders of magnitude of speed-up in runtime, with less than 10% area overhead compared to the optimal solution by ILP.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127520149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Using DSP blocks to compute CRC hash in FPGA (abstract only) 在FPGA中使用DSP块计算CRC哈希(仅抽象)
V. Pus, Lukás Kekely, Tomás Závodník
Hash table and its variations are common ways to implement lookup operations in FPGA. The process of adding to, deleting from, and searching in the hash table uses one or more hash functions to compute the address to the table. A suitable hash function must meet statistical properties such as uniform distribution, use of all input bits, large change of output based on small change of input. Other desirable parameters are high throughput and low FPGA resources usage. We propose a novel approach to the CRC hash computation in FPGA. The method is suitable for applications such as hash tables, which use parallel inputs of fixed size and require high throughput. We employ DSP blocks present in modern FPGAs to perform all the necessary XOR operations, therefore our solution does not use any LUTs. We propose a Monte Carlo based heuristic to reduce the number of DSP blocks required. Our experimental results show that one DSP block capable of 48 XOR operations can replace around eleven 6-input LUTs. Our results further show that our solution performs less XOR operations than the solution with LUTs optimized by the synthesizer.
哈希表及其变体是FPGA中实现查找操作的常用方法。在哈希表中添加、删除和搜索的过程使用一个或多个哈希函数来计算表的地址。一个合适的哈希函数必须满足统计特性,如均匀分布、使用所有输入位、基于小输入变化的大输出变化。其他理想的参数是高吞吐量和低FPGA资源使用。我们提出了一种新的FPGA CRC哈希计算方法。该方法适用于哈希表等使用固定大小的并行输入且需要高吞吐量的应用程序。我们使用现代fpga中存在的DSP块来执行所有必要的异或操作,因此我们的解决方案不使用任何lut。我们提出了一种基于蒙特卡罗的启发式算法来减少所需的DSP块数量。我们的实验结果表明,一个能够进行48次异或运算的DSP模块可以取代大约11个6输入lut。我们的结果进一步表明,我们的解决方案比使用合成器优化的lut的解决方案执行更少的异或操作。
{"title":"Using DSP blocks to compute CRC hash in FPGA (abstract only)","authors":"V. Pus, Lukás Kekely, Tomás Závodník","doi":"10.1145/2554688.2554689","DOIUrl":"https://doi.org/10.1145/2554688.2554689","url":null,"abstract":"Hash table and its variations are common ways to implement lookup operations in FPGA. The process of adding to, deleting from, and searching in the hash table uses one or more hash functions to compute the address to the table. A suitable hash function must meet statistical properties such as uniform distribution, use of all input bits, large change of output based on small change of input. Other desirable parameters are high throughput and low FPGA resources usage. We propose a novel approach to the CRC hash computation in FPGA. The method is suitable for applications such as hash tables, which use parallel inputs of fixed size and require high throughput. We employ DSP blocks present in modern FPGAs to perform all the necessary XOR operations, therefore our solution does not use any LUTs. We propose a Monte Carlo based heuristic to reduce the number of DSP blocks required. Our experimental results show that one DSP block capable of 48 XOR operations can replace around eleven 6-input LUTs. Our results further show that our solution performs less XOR operations than the solution with LUTs optimized by the synthesizer.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127360374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformations for throughput optimization in high-level synthesis (abstract only) 在高级合成中进行吞吐量优化的转换(仅抽象)
Peng Li, L. Pouchet, Deming Chen, J. Cong
Programming productivity of FPGA devices remains a significant challenge, despite the emergence of robust high level synthesis tools to automatically transform codes written in high-level languages into RTL implementations. Focusing on a class of programs with regular loop bounds and array accesses (so-called affine programs), the polyhedral compilation framework provides a convenient environment to automate many of the manual program transformation tasks that are still needed to improve the QoR of the HLS tool. In this work, we demonstrate that tiling-driven affine loop transformations, while mandatory to ensure good data reuse and reduce off-chip communication volumes, are not always enough to achieve the best throughput, determined by the Initiation Interval (II) for loop pipelining. We develop additional techniques to optimize the computation part to be executed on the FPGA, using Index-Set Splitting (ISS) to split loops into sub-loops with different properties (sequential/parallel, different memory port conflicts features). This is motivated by the presence of non-uniform data dependences in some affine benchmarks, which are not effectively handled by the affine transformation system for tiling implemented in the PolyOpt/HLS software. We develop a customized affine+ISS optimization algorithm that aims at reducing the II of pipelined inner loops to reduce the program latency. We report experimental results on numerous affine computations.
尽管出现了强大的高级综合工具,可以自动将用高级语言编写的代码转换为RTL实现,但FPGA设备的编程效率仍然是一个重大挑战。多面体编译框架专注于一类具有规则循环边界和数组访问的程序(所谓的仿射程序),提供了一个方便的环境来自动化许多手动程序转换任务,这些任务仍然需要提高HLS工具的QoR。在这项工作中,我们证明了平铺驱动的仿射环路转换,虽然必须确保良好的数据重用和减少片外通信量,但并不总是足以实现最佳吞吐量,这是由环路流水线的起始间隔(II)决定的。我们开发了额外的技术来优化在FPGA上执行的计算部分,使用索引集分割(ISS)将环路分割成具有不同属性(顺序/并行,不同存储端口冲突特征)的子环路。这是由于在一些仿射基准测试中存在不一致的数据依赖,而PolyOpt/HLS软件中实现的仿射转换系统无法有效地处理这些数据依赖。我们开发了一种定制的仿射+ISS优化算法,旨在减少流水线内循环的II,以减少程序延迟。我们报告了大量仿射计算的实验结果。
{"title":"Transformations for throughput optimization in high-level synthesis (abstract only)","authors":"Peng Li, L. Pouchet, Deming Chen, J. Cong","doi":"10.1145/2554688.2554772","DOIUrl":"https://doi.org/10.1145/2554688.2554772","url":null,"abstract":"Programming productivity of FPGA devices remains a significant challenge, despite the emergence of robust high level synthesis tools to automatically transform codes written in high-level languages into RTL implementations. Focusing on a class of programs with regular loop bounds and array accesses (so-called affine programs), the polyhedral compilation framework provides a convenient environment to automate many of the manual program transformation tasks that are still needed to improve the QoR of the HLS tool. In this work, we demonstrate that tiling-driven affine loop transformations, while mandatory to ensure good data reuse and reduce off-chip communication volumes, are not always enough to achieve the best throughput, determined by the Initiation Interval (II) for loop pipelining. We develop additional techniques to optimize the computation part to be executed on the FPGA, using Index-Set Splitting (ISS) to split loops into sub-loops with different properties (sequential/parallel, different memory port conflicts features). This is motivated by the presence of non-uniform data dependences in some affine benchmarks, which are not effectively handled by the affine transformation system for tiling implemented in the PolyOpt/HLS software. We develop a customized affine+ISS optimization algorithm that aims at reducing the II of pipelined inner loops to reduce the program latency. We report experimental results on numerous affine computations.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128927381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Producing high-quality real-time HDR video system with FPGA (abstract only) 用FPGA制作高质量实时HDR视频系统(仅抽象)
T. Ai, Mir Adnan Ali, J. Steffan, Kalin Ovtcharov, Sarmad Zulfiqar, Steve Mann
Video cameras can only take photographs with limited dynamic range. One method to overcome this is to combine differently exposed images of the same subject matter (i.e. a Wyckoff Set), producing a High Dynamic Range (HDR) result. HDR digital photography started almost 20 years ago. Now, it is possible to produce HDR video in real-time, on both high-power CPU/GPU systems, as well as low-power FPGA boards. However, other FPGA implementations have relied upon methods that are less accurate than current CPU and GPU-based methods. Namely, the earlier FPGA approaches used weighted sum for image compositing. In this paper we provide a novel method for real-time HDR com-positing. As an essential part of an upgraded HDR video production system, the resulting system combines differently exposed video stream (of the same subject matter) in Full HD (1080p at 60fps) on a Kintex-7 FPGA. The proposed work flow, implemented with software written in C, estimates the camera response function according to its quadtree representation and generates the compositing circuit in Verilog HDL from a Wyckoff Set. This circuit consists of parts that perform addressing using multiplexer networks and estimation with bilinear interpolation. It is parameterizable by user-specified error constraints, allowing us to explore the trade-offs in resource usage and precision of the implementation. Here is an MD5 hash function sum generated for the rest of the paper: 07897e61027d15dc3600fadbccfbd67d, citation date: December 18, 2013.
摄像机只能拍摄有限动态范围的照片。克服这一问题的一种方法是将同一主题的不同曝光图像(即Wyckoff集)组合在一起,产生高动态范围(HDR)结果。HDR数码摄影始于近20年前。现在,可以在高功率CPU/GPU系统以及低功耗FPGA板上实时生成HDR视频。然而,其他FPGA实现依赖于比当前基于CPU和gpu的方法更不准确的方法。也就是说,早期的FPGA方法使用加权和进行图像合成。本文提出了一种实时HDR合成的新方法。作为升级后的HDR视频制作系统的重要组成部分,由此产生的系统在Kintex-7 FPGA上以全高清(1080p / 60fps)的方式组合了不同曝光的视频流(相同主题)。该工作流程由C语言编写的软件实现,根据四叉树表示估计相机响应函数,并从Wyckoff集生成Verilog HDL合成电路。该电路由使用多路复用器网络进行寻址和使用双线性插值进行估计的部分组成。它可以通过用户指定的错误约束进行参数化,从而允许我们探索在资源使用和实现精度方面的权衡。这是为论文其余部分生成的MD5哈希函数和:07897e61027d15dc3600fadbccfbd67d,引用日期:December 18, 2013。
{"title":"Producing high-quality real-time HDR video system with FPGA (abstract only)","authors":"T. Ai, Mir Adnan Ali, J. Steffan, Kalin Ovtcharov, Sarmad Zulfiqar, Steve Mann","doi":"10.1145/2554688.2554738","DOIUrl":"https://doi.org/10.1145/2554688.2554738","url":null,"abstract":"Video cameras can only take photographs with limited dynamic range. One method to overcome this is to combine differently exposed images of the same subject matter (i.e. a Wyckoff Set), producing a High Dynamic Range (HDR) result. HDR digital photography started almost 20 years ago. Now, it is possible to produce HDR video in real-time, on both high-power CPU/GPU systems, as well as low-power FPGA boards. However, other FPGA implementations have relied upon methods that are less accurate than current CPU and GPU-based methods. Namely, the earlier FPGA approaches used weighted sum for image compositing. In this paper we provide a novel method for real-time HDR com-positing. As an essential part of an upgraded HDR video production system, the resulting system combines differently exposed video stream (of the same subject matter) in Full HD (1080p at 60fps) on a Kintex-7 FPGA. The proposed work flow, implemented with software written in C, estimates the camera response function according to its quadtree representation and generates the compositing circuit in Verilog HDL from a Wyckoff Set. This circuit consists of parts that perform addressing using multiplexer networks and estimation with bilinear interpolation. It is parameterizable by user-specified error constraints, allowing us to explore the trade-offs in resource usage and precision of the implementation. Here is an MD5 hash function sum generated for the rest of the paper: 07897e61027d15dc3600fadbccfbd67d, citation date: December 18, 2013.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130562839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1