Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献_第5页

Modular multi-ported SRAM-based memories 模块化多端口sram存储器

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554773

Ameer Abdelhadi, G. Lemieux

Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Altera's Quartus tools.

多端口ram对于高性能并行计算系统是必不可少的。VLIW和矢量处理器、CGRAs、dsp、cmp和其他处理系统通常依赖于多端口存储器进行并行访问，因此性能更高。虽然具有大量读写端口的存储器很重要，但它们的高实现成本意味着它们在设计中很少使用。因此，FPGA供应商只提供双端口块ram来处理大多数使用模式。本文提出了一种新颖的模块化方法，以基本的双端口RAM块构建多端口存储器。与其他多端口RAM设计一样，每个写端口使用不同的RAM组，每个读端口使用组复制。这项工作的主要贡献是将以前的活值表(LVT)和异或方法合并到一个通用设计中，该设计使用了一个通用的、更简单的结构，我们称之为基于无效的活值表(I-LVT)。与常规LVT一样，I-LVT确定要从哪个银行读取数据，但不同之处在于如何对表进行更新;LVT方法需要多个写端口，通常导致基于寄存器的区域密集型实现，而XOR方法使用更宽的内存来容纳XOR数据，并且受较低的时钟速度的影响。提出并评估了两种特定的I-LVT实现:二进制编码和单热编码。I-LVT方法特别适用于较大的多端口ram，因为该表仅在SRAM单元中实现。与之前的方法相比，I-LVT方法在占用更少的块RAM的同时提供了更高的性能:对于几种配置，建议的方法将块RAM的使用减少了44%以上，并将时钟速度提高了76%以上。为了帮助其他人，我们将完全参数化的Verilog实现作为开源硬件库发布。该库已经使用ModelSim和Altera的Quartus工具进行了广泛的测试。

{"title":"Modular multi-ported SRAM-based memories","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1145/2554688.2554773","DOIUrl":"https://doi.org/10.1145/2554688.2554773","url":null,"abstract":"Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Altera's Quartus tools.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131051711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only) 基于分离设计配置的非自适应稀疏恢复和故障规避(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554758

Ahmad Alzahrani, R. Demara

A run-time fault diagnosis and evasion scheme for reconfigurable devices is developed based on an explicit Non-adaptive Group Testing (NGT). NGT involves grouping disjunct subsets of reconfigurable resources into test pools, or samples. Each test pool realizes a Diagnostic Configuration (DC) performing functional testing during diagnosis procedure. The collective test outcomes after testing each diagnostic pool can be efficiently decoded to identify up to d defective logic resources. An algorithm for constructing NGT sampling procedure and resource placement during design time with optimal minimal number of test groups is derived through the well-known in statistical literature d-disjunctness property. The combinatorial properties of resultant DCs also guarantee that any possible set of defective resources less than or equal to d are not utilized by at least one DC, allowing a low-overhead fault resolution. It also provides the ability to assess the resources state of failure. The proposed testing scheme thus avoids time-intensive run-time diagnosis imposed by previously proposed adaptive group testing for reconfigurable hardware without compromising diagnostic coverage. In addition, proposed NGT scheme can be combined with other fault tolerance approaches to ameliorate their fault recovery strategies. Experimental results for a set of MCNC benchmarks using Xilinx ISE Design Suite on a Virtex-5 FPGA have demonstrated d-diagnosability at slice level with average accuracy of 99.15% and 97.76% for d=1 and d=2, respectively.

提出了一种基于显式非自适应组测试(NGT)的可重构设备运行时故障诊断与规避方案。NGT涉及将可重构资源的不相交子集分组到测试池或样本中。每个测试池实现一个DC (Diagnostic Configuration)，在诊断过程中进行功能测试。测试每个诊断池后的集体测试结果可以有效解码，以识别多达d个有缺陷的逻辑资源。利用统计文献中众所周知的d-分离性，导出了一种在设计时以最优最小测试组数构建NGT采样程序和资源放置的算法。所得到的DC的组合特性还保证了小于或等于d的任何可能的缺陷资源集不被至少一个DC利用，从而允许低开销的故障解决。它还提供了评估资源故障状态的能力。因此，所提出的测试方案避免了之前提出的自适应组测试对可重构硬件施加的时间密集型运行时诊断，而不影响诊断覆盖率。此外，本文提出的NGT方案还可以与其他容错方法相结合，改进其故障恢复策略。在Virtex-5 FPGA上使用Xilinx ISE Design Suite进行的一组MCNC基准测试的实验结果表明，在d=1和d=2时，片级的d可诊断性分别为99.15%和97.76%。

{"title":"Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only)","authors":"Ahmad Alzahrani, R. Demara","doi":"10.1145/2554688.2554758","DOIUrl":"https://doi.org/10.1145/2554688.2554758","url":null,"abstract":"A run-time fault diagnosis and evasion scheme for reconfigurable devices is developed based on an explicit Non-adaptive Group Testing (NGT). NGT involves grouping disjunct subsets of reconfigurable resources into test pools, or samples. Each test pool realizes a Diagnostic Configuration (DC) performing functional testing during diagnosis procedure. The collective test outcomes after testing each diagnostic pool can be efficiently decoded to identify up to d defective logic resources. An algorithm for constructing NGT sampling procedure and resource placement during design time with optimal minimal number of test groups is derived through the well-known in statistical literature d-disjunctness property. The combinatorial properties of resultant DCs also guarantee that any possible set of defective resources less than or equal to d are not utilized by at least one DC, allowing a low-overhead fault resolution. It also provides the ability to assess the resources state of failure. The proposed testing scheme thus avoids time-intensive run-time diagnosis imposed by previously proposed adaptive group testing for reconfigurable hardware without compromising diagnostic coverage. In addition, proposed NGT scheme can be combined with other fault tolerance approaches to ameliorate their fault recovery strategies. Experimental results for a set of MCNC benchmarks using Xilinx ISE Design Suite on a Virtex-5 FPGA have demonstrated d-diagnosability at slice level with average accuracy of 99.15% and 97.76% for d=1 and d=2, respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128708820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Novel FPGA clock network with low latency and skew (abstract only) 一种新颖的低时延、低倾斜FPGA时钟网络(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554722

Lei Li, Jian Wang, Jinmei Lai

Clock network is a dedicated network for distributing multiple clock signals to every logic modules in a system. Be significantly different from ASIC where the clock tree is custom built by users, clock network in FPGA is usually fixed after chip fabrication and cannot be changed for different user circuits. This paper is committed to design and implement FPGA clock network with low latency and skew. We first propose a novel clock network for FPG, which is a backbone-branches topology and can be easily integrated to the tiled FPGA with reasonable area. There are one clock backbone and several primary clock branches in the network. When the chip scales up, this clock network can be extended easily. Afterwards, series of strategies such as hybrid multiplexer, bypassing, looping back and Programmable Delay Adjustment Unit (DAU) are employed to optimize latency and skew. Moreover, the prominent couple capacitance and crosstalk effect of clock routing in nanometer are also given consideration in physical implementation. This clock network is applied to own-designed FPGA with 65nm technology. Post-layout simulation results indicate that our clock network with normal loads can uphold 600MHz clock with the maximum clock latency and skew being typically 2.22ns and 40ps respectively, 1.79ns and 39ps in the fast case, achieving up to 78.2% improvement for skew as well as 47.5% for latency, compared to a commercial 65nm FPGA device.

时钟网络是将多个时钟信号分配到系统各个逻辑模块的专用网络。与ASIC的时钟树由用户自定义不同，FPGA中的时钟网络通常在芯片制造后固定，不能针对不同的用户电路进行更改。本文致力于设计和实现低时延、低偏差的FPGA时钟网络。我们首先提出了一种新型的FPG时钟网络，它是一种骨干分支拓扑，可以很容易地集成到平铺FPGA上，并且面积合理。网络中有一个时钟主干和几个主时钟分支。当芯片规模增大时，该时钟网络可以很容易地扩展。然后，采用混合复用、旁路、回环和可编程延迟调整单元(DAU)等一系列策略优化时延和偏度。此外，在物理实现中还考虑了纳米时钟路由中突出的耦合电容和串扰效应。该时钟网络应用于自主设计的65nm工艺FPGA。布局后仿真结果表明，我们的时钟网络在正常负载下可以维持600MHz时钟，最大时钟延迟和偏差分别为2.22ns和40ps，快速情况下为1.79ns和39ps，与商用65nm FPGA器件相比，偏差和延迟分别提高了78.2%和47.5%。

{"title":"Novel FPGA clock network with low latency and skew (abstract only)","authors":"Lei Li, Jian Wang, Jinmei Lai","doi":"10.1145/2554688.2554722","DOIUrl":"https://doi.org/10.1145/2554688.2554722","url":null,"abstract":"Clock network is a dedicated network for distributing multiple clock signals to every logic modules in a system. Be significantly different from ASIC where the clock tree is custom built by users, clock network in FPGA is usually fixed after chip fabrication and cannot be changed for different user circuits. This paper is committed to design and implement FPGA clock network with low latency and skew. We first propose a novel clock network for FPG, which is a backbone-branches topology and can be easily integrated to the tiled FPGA with reasonable area. There are one clock backbone and several primary clock branches in the network. When the chip scales up, this clock network can be extended easily. Afterwards, series of strategies such as hybrid multiplexer, bypassing, looping back and Programmable Delay Adjustment Unit (DAU) are employed to optimize latency and skew. Moreover, the prominent couple capacitance and crosstalk effect of clock routing in nanometer are also given consideration in physical implementation. This clock network is applied to own-designed FPGA with 65nm technology. Post-layout simulation results indicate that our clock network with normal loads can uphold 600MHz clock with the maximum clock latency and skew being typically 2.22ns and 40ps respectively, 1.79ns and 39ps in the fast case, achieving up to 78.2% improvement for skew as well as 47.5% for latency, compared to a commercial 65nm FPGA device.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127175231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards interconnect-adaptive packing for FPGAs fpga互连自适应封装研究

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554783

J. Luu, Jonathan Rose, J. Anderson

In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.

为了研究新的FPGA逻辑块，FPGA架构师传统上需要定制CAD工具来利用这些块的新特性和特征。创建这样的CAD工具所必需的软件开发工作可能是一个耗时的过程，并且可能极大地限制了所探索的体系结构的数量和种类。因此，建筑师需要灵活的CAD工具，可以在很少或没有软件修改的情况下探索多样化的空间。现有的灵活的CAD工具受到不切实际的长运行时间和/或不能有效地利用正在研究的逻辑块的重要新特性的影响。这项工作是通过增强开源VTR CAD流的打包阶段来解决这些问题的一步[17]，以有效地处理用于创建多种有用的新块的通用互连结构。这些结构包括横杆、传输链、专用信号等。为了实现这一目标，我们在这项工作中采用了三种技术:推测封装，预封装和互连感知引脚计数。我们展示了这些技术，以及三个小的修改，导致了运行时的改进和跨架构范围的结果质量，同时扩展了可以探索的架构的范围。与VTR 1.0相比[17]，我们展示了可断裂LUT架构的封装速度平均提高了12倍，最小通道宽度降低了20%，关键路径延迟降低了6%。对于具有不可断裂lut的体系结构和具有减少交叉条的体系结构，我们获得了6到7倍的加速。此外，我们还演示了对带有进位链的逻辑块的打包支持。

{"title":"Towards interconnect-adaptive packing for FPGAs","authors":"J. Luu, Jonathan Rose, J. Anderson","doi":"10.1145/2554688.2554783","DOIUrl":"https://doi.org/10.1145/2554688.2554783","url":null,"abstract":"In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123034422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Optimizing effective interconnect capacitance for FPGA power reduction 优化FPGA降低功耗的有效互连电容

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554788

Safeen Huda, J. Anderson, H. Tamura

We propose a technique to reduce the effective parasitic capacitance of interconnect routing conductors in a bid to simultaneously reduce power consumption and improve delay. The parasitic capacitance reduction is achieved by ensuring routing conductors adjacent to those used by timing critical or high activity nets are left floating - disconnected from either VDD or GND. In doing so, the effective coupling capacitance between the conductors is reduced, because the original coupling capacitance between the conductors is placed in series with other capacitances in the circuit (series combinations of capacitors correspond to lower effective capacitance). To ensure unused conductors can be allowed to float requires the use of tri-state routing buffers, and to that end, we also propose low-cost tri-state buffer circuitry. We also introduce CAD techniques to maximize the likelihood that unused routing conductors are made to be adjacent to those used by nets with high activity or low slack, improving both power and speed. Results show that interconnect dynamic power reductions of up to ~15.5% are expected to be achieved with a critical path degradation of ~1%, and a total area overhead of ~2.1%.

我们提出了一种降低互连布线导体的有效寄生电容的技术，以同时降低功耗和改善延迟。寄生电容的减少是通过确保与时序关键或高活度网络使用的导线相邻的布线导体保持浮动来实现的-与VDD或GND断开连接。这样做会降低导体之间的有效耦合电容，因为导体之间的原始耦合电容与电路中的其他电容串联(电容器的串联组合对应于较低的有效电容)。为了确保未使用的导体可以被允许浮动，需要使用三态路由缓冲，为此，我们还提出了低成本的三态缓冲电路。我们还引入了CAD技术，以最大限度地提高未使用的布线导线与高活动性或低松弛性的网络使用的导线相邻的可能性，从而提高功率和速度。结果表明，在关键路径退化约为1%、总面积开销约为2.1%的情况下，互连动态功耗可降低约15.5%。

{"title":"Optimizing effective interconnect capacitance for FPGA power reduction","authors":"Safeen Huda, J. Anderson, H. Tamura","doi":"10.1145/2554688.2554788","DOIUrl":"https://doi.org/10.1145/2554688.2554788","url":null,"abstract":"We propose a technique to reduce the effective parasitic capacitance of interconnect routing conductors in a bid to simultaneously reduce power consumption and improve delay. The parasitic capacitance reduction is achieved by ensuring routing conductors adjacent to those used by timing critical or high activity nets are left floating - disconnected from either VDD or GND. In doing so, the effective coupling capacitance between the conductors is reduced, because the original coupling capacitance between the conductors is placed in series with other capacitances in the circuit (series combinations of capacitors correspond to lower effective capacitance). To ensure unused conductors can be allowed to float requires the use of tri-state routing buffers, and to that end, we also propose low-cost tri-state buffer circuitry. We also introduce CAD techniques to maximize the likelihood that unused routing conductors are made to be adjacent to those used by nets with high activity or low slack, improving both power and speed. Results show that interconnect dynamic power reductions of up to ~15.5% are expected to be achieved with a critical path degradation of ~1%, and a total area overhead of ~2.1%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129611917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Accelerating massive short reads mapping for next generation sequencing (abstract only) 为下一代测序加速大规模短读段映射(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554707

Chunming Zhang, Wen Tang, Guangming Tan

Due to the explosion of gene sequencing data with over one billion reads per run, the data-intensive computations of Next Generation Sequencing (NGS) applications pose great challenges to current computing capability. In this paper we investigate both algorithmic and architectural accelerating strategies to a typical NGS analysis algorithm -- short reads mapping -- on a commodity multicore and customizable FPGA coprocessor architecture, respectively. First, we propose a hash buckets reorder algorithm that increases shared cache parallelism during the course of searching hash index. The algorithmic strategy achieves 122Gbp/day throughput by exploiting shared-cache parallelism, that leads to performance improvement of 2 times on an 8-core Intel Xeon processor. Second, we develop a FPGA coprocessor that leverages both bit-level and word-level parallelism with scatter-gather memory mechanism to speedup inherent irregular memory access operations by increasing effective memory bandwidth. Our customized FPGA coprocessor achieves 947Gbp per day throughput, that is 189 times higher than current mapping tools on single CPU core, and above 2 times higher than a 64-core multi-processor system. The coprocessor's power efficiency is 29 times higher than a conventional 64-core multi-processor. The results indicate that the customized FPGA coprocessor architecture, that is configured with scatter-gather memory's word-level access, appeals to data intensive applications.

由于基因测序数据的爆炸式增长，每次运行读取量超过10亿次，下一代测序(NGS)应用程序的数据密集型计算对当前的计算能力提出了巨大挑战。在本文中，我们分别在商品多核和可定制的FPGA协处理器架构上研究了典型NGS分析算法(短读取映射)的算法和架构加速策略。首先，我们提出了一种哈希桶重排序算法，该算法在搜索哈希索引的过程中增加了共享缓存的并行性。该算法策略通过利用共享缓存并行性实现了122Gbp/天的吞吐量，这使得在8核英特尔至强处理器上的性能提高了2倍。其次，我们开发了一种FPGA协处理器，利用位级和字级并行性以及散射-收集存储器机制，通过增加有效存储器带宽来加速固有的不规则存储器访问操作。我们定制的FPGA协处理器实现了每天947Gbp的吞吐量，是目前单核CPU映射工具的189倍，是64核多处理器系统的2倍以上。协处理器的能效是传统64核多处理器的29倍。结果表明，配置了散集存储器字级访问的定制FPGA协处理器架构适合于数据密集型应用。

{"title":"Accelerating massive short reads mapping for next generation sequencing (abstract only)","authors":"Chunming Zhang, Wen Tang, Guangming Tan","doi":"10.1145/2554688.2554707","DOIUrl":"https://doi.org/10.1145/2554688.2554707","url":null,"abstract":"Due to the explosion of gene sequencing data with over one billion reads per run, the data-intensive computations of Next Generation Sequencing (NGS) applications pose great challenges to current computing capability. In this paper we investigate both algorithmic and architectural accelerating strategies to a typical NGS analysis algorithm -- short reads mapping -- on a commodity multicore and customizable FPGA coprocessor architecture, respectively. First, we propose a hash buckets reorder algorithm that increases shared cache parallelism during the course of searching hash index. The algorithmic strategy achieves 122Gbp/day throughput by exploiting shared-cache parallelism, that leads to performance improvement of 2 times on an 8-core Intel Xeon processor. Second, we develop a FPGA coprocessor that leverages both bit-level and word-level parallelism with scatter-gather memory mechanism to speedup inherent irregular memory access operations by increasing effective memory bandwidth. Our customized FPGA coprocessor achieves 947Gbp per day throughput, that is 189 times higher than current mapping tools on single CPU core, and above 2 times higher than a 64-core multi-processor system. The coprocessor's power efficiency is 29 times higher than a conventional 64-core multi-processor. The results indicate that the customized FPGA coprocessor architecture, that is configured with scatter-gather memory's word-level access, appeals to data intensive applications.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129681012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Big data genome sequencing on Zynq based clusters (abstract only) 基于Zynq集群的大数据基因组测序(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554694

Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung

Next-generation sequencing (NGS) problems have attracted many attentions of researchers in biological and medical computing domains. The current state-of-the-art NGS computing machines are dramatically lowering the cost and increasing the throughput of DNA sequencing. In this paper, we propose a practical study that uses Xilinx Zynq board to summarize acceleration engines using FPGA accelerators and ARM processors for the state-of-the-art short read mapping approaches. The heterogeneous processors and accelerators are coupled with each other using a general Hadoop distributed processing framework. First the reads are collected by the central server, and then distributed to multiple accelerators on the Zynq for hardware acceleration. Therefore, the combination of hardware acceleration and Map-Reduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. Our approach is based on preprocessing the reference genomes and iterative jobs for aligning the continuous incoming reads. The hardware acceleration is based on the creditable read-mapping algorithm RMAP software approach. Furthermore, the speedup analysis on a Hadoop cluster, which concludes 8 development boards, is evaluated. Experimental results demonstrate that our proposed architecture and methods has the speedup of more than 112X, and is scalable with the number of accelerators. Finally, the Zynq based cluster has efficient potential to accelerate even general large scale big data applications. This work was supported by the NSFC grants No. 61379040, No. 61272131 and No. 61202053.

下一代测序(NGS)问题引起了生物和医学计算领域研究人员的广泛关注。目前最先进的NGS计算机器大大降低了成本，提高了DNA测序的吞吐量。在本文中，我们提出了一项实际研究，使用Xilinx Zynq板来总结使用FPGA加速器和ARM处理器的最先进的短读映射方法的加速引擎。异构处理器和加速器使用通用的Hadoop分布式处理框架相互耦合。首先，读取数据由中央服务器收集，然后分发到Zynq上的多个加速器进行硬件加速。因此，硬件加速和Map-Reduce执行流程的结合可以大大加快短长度读取到已知参考基因组的比对任务。我们的方法是基于预处理参考基因组和迭代工作，以对准连续的传入读取。硬件加速基于可信读映射算法RMAP软件方法。此外，还对包含8个开发板的Hadoop集群进行了加速分析。实验结果表明，我们提出的架构和方法具有超过112X的加速，并且可以随加速器数量的增加而扩展。最后，基于Zynq的集群具有加速一般大规模大数据应用的有效潜力。国家自然科学基金项目(61379040、61272131和61202053)资助。

{"title":"Big data genome sequencing on Zynq based clusters (abstract only)","authors":"Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung","doi":"10.1145/2554688.2554694","DOIUrl":"https://doi.org/10.1145/2554688.2554694","url":null,"abstract":"Next-generation sequencing (NGS) problems have attracted many attentions of researchers in biological and medical computing domains. The current state-of-the-art NGS computing machines are dramatically lowering the cost and increasing the throughput of DNA sequencing. In this paper, we propose a practical study that uses Xilinx Zynq board to summarize acceleration engines using FPGA accelerators and ARM processors for the state-of-the-art short read mapping approaches. The heterogeneous processors and accelerators are coupled with each other using a general Hadoop distributed processing framework. First the reads are collected by the central server, and then distributed to multiple accelerators on the Zynq for hardware acceleration. Therefore, the combination of hardware acceleration and Map-Reduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. Our approach is based on preprocessing the reference genomes and iterative jobs for aligning the continuous incoming reads. The hardware acceleration is based on the creditable read-mapping algorithm RMAP software approach. Furthermore, the speedup analysis on a Hadoop cluster, which concludes 8 development boards, is evaluated. Experimental results demonstrate that our proposed architecture and methods has the speedup of more than 112X, and is scalable with the number of accelerators. Finally, the Zynq based cluster has efficient potential to accelerate even general large scale big data applications. This work was supported by the NSFC grants No. 61379040, No. 61272131 and No. 61202053.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127694893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

OmpSs@Zynq all-programmable SoC ecosystem OmpSs@Zynq全可编程SoC生态系统

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554777

Antonio Filgueras, E. Gil, Daniel Jiménez-González, C. Álvarez, X. Martorell, Jan Langer, Juanjo Noguera, K. Vissers

OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.

OmpSs是一个类似openmp的基于指令的编程模型，包括异构执行(MIC、GPU、SMP等)和运行时任务依赖关系管理。事实上，omps对最近出现的OpenMP 4.0规范产生了很大的影响。Zynq全可编程SoC结合了SMP和FPGA的功能，并有利于DLP, ILP和TLP的并行性，以便有效地利用新技术改进和芯片资源容量。在本文中，我们专注于可编程性和异构执行支持，提出了OmpSs编程模型和Zynq全可编程SoC平台的成功结合。

引用次数: 28

Theory and algorithm for generalized memory partitioning in high-level synthesis 高级综合中广义内存划分的理论与算法

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554780

Yuxin Wang, Peng Li, J. Cong

The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. Simultaneous data access is highly restricted by the data mapping strategy and memory port constraint. Memory partitioning can efficiently map data elements in the same logical array onto multiple physical banks so that the accesses to the array are parallelized. Previous methods for memory partitioning mainly focused on cyclic partitioning for single-port memory. In this work we propose a generalized memory-partitioning framework to provide high data throughput of on-chip memories. We generalize cyclic partitioning into block-cyclic partitioning for a larger design space exploration. We build the conflict detection algorithm on polytope emptiness testing, and use integer points counting in polytopes for intra-bank offset generation. Memory partitioning for multi-port memory is supported in this framework. Experimental results demonstrate that compared to the state-of-art partitioning algorithm, our proposed algorithm can reduce the number of block RAM by 19.58%, slice by 20.26% and DSP by 50%.

高级综合工具的显著发展极大地促进了fpga作为通用计算平台的发展。在数据路径的并行性优化过程中，内存成为阻碍性能增强的关键瓶颈。同时数据访问受到数据映射策略和存储端口约束的高度限制。内存分区可以有效地将同一逻辑数组中的数据元素映射到多个物理银行，从而实现对数组的并行访问。以前的内存分区方法主要集中在单端口内存的循环分区上。在这项工作中，我们提出了一个通用的内存分区框架，以提供高数据吞吐量的片上存储器。为了更大的设计空间探索，我们将循环分区推广为块循环分区。我们建立了基于多面体空性测试的冲突检测算法，并使用多面体的整数点计数来生成银行间偏移量。该框架支持多端口内存的内存分区。实验结果表明，与现有的分区算法相比，该算法可减少19.58%的块RAM数量，20.26%的切片数量和50%的DSP数量。

{"title":"Theory and algorithm for generalized memory partitioning in high-level synthesis","authors":"Yuxin Wang, Peng Li, J. Cong","doi":"10.1145/2554688.2554780","DOIUrl":"https://doi.org/10.1145/2554688.2554780","url":null,"abstract":"The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. Simultaneous data access is highly restricted by the data mapping strategy and memory port constraint. Memory partitioning can efficiently map data elements in the same logical array onto multiple physical banks so that the accesses to the array are parallelized. Previous methods for memory partitioning mainly focused on cyclic partitioning for single-port memory. In this work we propose a generalized memory-partitioning framework to provide high data throughput of on-chip memories. We generalize cyclic partitioning into block-cyclic partitioning for a larger design space exploration. We build the conflict detection algorithm on polytope emptiness testing, and use integer points counting in polytopes for intra-bank offset generation. Memory partitioning for multi-port memory is supported in this framework. Experimental results demonstrate that compared to the state-of-art partitioning algorithm, our proposed algorithm can reduce the number of block RAM by 19.58%, slice by 20.26% and DSP by 50%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125697431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

Quantifying the cost and benefit of latency insensitive communication on FPGAs 量化fpga上延迟不敏感通信的成本和收益

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554786

Kevin E. Murray, Vaughn Betz

Latency insensitive communication offers many potential benefits for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-offs associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quantifies their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and area-efficient latency insensitive systems.

延迟不敏感通信为FPGA设计提供了许多潜在的好处，包括通过启用自动流水线更容易地定时关闭，以及更容易与嵌入式noc接口。然而，理解与任何新设计风格相关的成本和权衡是很重要的。本文提出了延迟不敏感通信构建块的优化实现，量化了它们在面积和频率方面的开销，并为设计人员提供了如何生成高速和区域高效延迟不敏感系统的指导。

引用次数: 12