首页 > 最新文献

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
Loop Splitting for Efficient Pipelining in High-Level Synthesis 高阶合成中高效流水线的循环分割
Junyi Liu, John Wickerson, G. Constantinides
Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or non-uniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4.3× in terms of clock cycles per iteration. The optimized pipelines consume 2.0× as many LUTs, 1.8× as many registers, and 1.1× as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2.
循环流水线作为一种关键的优化方法在高阶综合中被广泛采用。然而,当复杂的内存依赖关系出现在循环中时,商业HLS工具仍然无法最大化管道性能。在本文中,我们利用参数多面体分析来推断不确定(即由未确定变量参数化)和/或非均匀(即在循环迭代之间变化)的内存依赖模式。我们开发了一个自动化的源代码到源代码转换,将循环分成几个部分,然后由Vivado HLS作为硬件生成后端进行合成。我们的技术允许生成的循环以最小的间隔运行,在那些违反依赖的迭代中自动插入静态确定的参数管道中断。我们在七个代表性基准测试上的实验表明,与默认循环流水线相比,我们的参数循环分割在每次迭代的时钟周期方面将流水线性能提高了4.3倍。优化后的管道消耗2.0倍的lut、1.8倍的寄存器和1.1倍的DSP块。因此,面积-时间积提高了近2倍。
{"title":"Loop Splitting for Efficient Pipelining in High-Level Synthesis","authors":"Junyi Liu, John Wickerson, G. Constantinides","doi":"10.1109/FCCM.2016.27","DOIUrl":"https://doi.org/10.1109/FCCM.2016.27","url":null,"abstract":"Loop pipelining is widely adopted as a key optimization method in high-level synthesis (HLS). However, when complex memory dependencies appear in a loop, commercial HLS tools are still not able to maximize pipeline performance. In this paper, we leverage parametric polyhedral analysis to reason about memory dependence patterns that are uncertain (i.e., parameterised by an undetermined variable) and/or non-uniform (i.e., varying between loop iterations). We develop an automated source-to-source code transformation to split the loop into pieces, which are then synthesised by Vivado HLS as the hardware generation back-end. Our technique allows generated loops to run with a minimal interval, automatically inserting statically-determined parametric pipeline breaks at those iterations violating dependencies. Our experiments on seven representative benchmarks show that, compared to default loop pipelining, our parametric loop splitting improves pipeline performance by 4.3× in terms of clock cycles per iteration. The optimized pipelines consume 2.0× as many LUTs, 1.8× as many registers, and 1.1× as many DSP blocks. Hence the area-time product is improved by nearly a factor of 2.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124055909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
SynADT: Dynamic Data Structures in High Level Synthesis SynADT:高级合成中的动态数据结构
Zeping Xue, David B. Thomas
Abstract Data Types (ADTs) such as dictionaries and lists are essential for many embedded computing applications such as network stacks. However, in heterogeneous systems, code using ADTs can usually only run in CPUs, because components written in HLS do not support dynamic data structures. HLS tools cannot be used to synthesise dynamic data structures directly because the use of pointers is very restricted, such as not supporting pointers to pointers or pointer casting. Consequently, it is unclear what the API should look like and how to express dynamic data structures in HLS so that the tools can compile them. We propose SynADT, which consists of a methodology and a benchmark. The methodology provides classic data structures (linked lists, binary trees, hash tables and vectors) using relativeaddresses instead of pointers in Vivado HLS. The benchmark can be used to evaluate the performance of data structures in HLS, ARM processors and soft processors such as MicroBlaze, CPUs can utilise either the default C memory allocator or a hardware memory allocator. We evaluate the data structures in a Zynq FPGA demonstrating scaling to approximately 10MB memory usage and 1M data items. With a workload that utilises 10MB memory, the HLS data structures operating at 150MHz are on average 1.35× faster than MicroBlaze data structures operating at 150MHz with the default C allocator and 7.97× slower than ARM processor data structures operating at 667MHz with the default C allocator.
抽象数据类型(adt),如字典和列表,对于许多嵌入式计算应用程序(如网络堆栈)是必不可少的。然而,在异构系统中,使用adt的代码通常只能在cpu中运行,因为用HLS编写的组件不支持动态数据结构。HLS工具不能直接用于合成动态数据结构,因为指针的使用非常有限,例如不支持指针指向指针或指针强制转换。因此,不清楚API应该是什么样子,以及如何在HLS中表达动态数据结构,以便工具可以编译它们。我们提出了SynADT,它由一个方法和一个基准组成。该方法在Vivado HLS中使用相对地址而不是指针提供了经典的数据结构(链表、二叉树、哈希表和向量)。该基准测试可用于评估HLS、ARM处理器和MicroBlaze等软处理器中数据结构的性能,cpu可以使用默认的C内存分配器或硬件内存分配器。我们评估了Zynq FPGA中的数据结构,演示了扩展到大约10MB内存使用和1M数据项。对于使用10MB内存的工作负载,在150MHz下运行的HLS数据结构比在150MHz下运行的MicroBlaze数据结构(默认C分配器)平均快1.35倍,比在667MHz下运行的ARM处理器数据结构(默认C分配器)慢7.97倍。
{"title":"SynADT: Dynamic Data Structures in High Level Synthesis","authors":"Zeping Xue, David B. Thomas","doi":"10.1109/FCCM.2016.26","DOIUrl":"https://doi.org/10.1109/FCCM.2016.26","url":null,"abstract":"Abstract Data Types (ADTs) such as dictionaries and lists are essential for many embedded computing applications such as network stacks. However, in heterogeneous systems, code using ADTs can usually only run in CPUs, because components written in HLS do not support dynamic data structures. HLS tools cannot be used to synthesise dynamic data structures directly because the use of pointers is very restricted, such as not supporting pointers to pointers or pointer casting. Consequently, it is unclear what the API should look like and how to express dynamic data structures in HLS so that the tools can compile them. We propose SynADT, which consists of a methodology and a benchmark. The methodology provides classic data structures (linked lists, binary trees, hash tables and vectors) using relativeaddresses instead of pointers in Vivado HLS. The benchmark can be used to evaluate the performance of data structures in HLS, ARM processors and soft processors such as MicroBlaze, CPUs can utilise either the default C memory allocator or a hardware memory allocator. We evaluate the data structures in a Zynq FPGA demonstrating scaling to approximately 10MB memory usage and 1M data items. With a workload that utilises 10MB memory, the HLS data structures operating at 150MHz are on average 1.35× faster than MicroBlaze data structures operating at 150MHz with the default C allocator and 7.97× slower than ARM processor data structures operating at 667MHz with the default C allocator.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130055434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
AutoSLIDE: Automatic Source-Level Instrumentation and Debugging for HLS AutoSLIDE:自动源级仪器和调试的HLS
Liwei Yang, S. Gurumani, Deming Chen, K. Rupnow
Improved quality of results from high level synthesis (HLS) tools have led to their increased adoption in hardware design. However, functional verification of HLS-produced designs remains a major challenge. Once a bug is exposed, designers must backtrace thousands of signals and simulation cycles to determine the underlying cause. The challenge is further exacerbated with HLS-produced non-human-readable RTL. In this paper, we present AutoSLIDE, an automated cross-layer verification framework that instruments critical operations, detects discrepancies between software and hardware execution, and traces the suspect datapath tree to identify bug source for the detected discrepancy. AutoSLIDE also maintains mappings between RTL datapath operations, LLVM-IR operations, and C/C++ source code to precisely pinpoint the root-cause of bugs to the exact line/operation in source code, substantially reducing user effort to localize bugs. We demonstrate the effectiveness by detecting and localizing bugs from former versions of the CHStone benchmark suite. Furthermore, we demonstrate the efficiency of AutoSLIDE, with low overhead in HLS time (27%), software trace gathering (10%), and significantly reduced trace size and simulation time compared to exhaustive instrumentation.
高级综合(HLS)工具的结果质量得到了改进,这使得它们在硬件设计中得到了越来越多的采用。然而,hls生产的设计的功能验证仍然是一个主要挑战。一旦漏洞暴露,设计人员必须回溯数千个信号和模拟周期,以确定潜在的原因。hls生成的非人类可读RTL进一步加剧了这一挑战。在本文中,我们提出了AutoSLIDE,这是一个自动的跨层验证框架,用于检测关键操作,检测软件和硬件执行之间的差异,并跟踪可疑的数据路径树以识别检测到的差异的错误来源。AutoSLIDE还维护RTL数据路径操作、LLVM-IR操作和C/ c++源代码之间的映射,以精确地将错误的根源定位到源代码中的确切行/操作,从而大大减少用户定位错误的工作量。我们通过检测和定位以前版本的CHStone基准套件中的错误来证明其有效性。此外,我们展示了AutoSLIDE的效率,与详尽的仪器相比,它在HLS时间上的开销很低(27%),软件跟踪收集(10%),并且显着减少了跟踪大小和模拟时间。
{"title":"AutoSLIDE: Automatic Source-Level Instrumentation and Debugging for HLS","authors":"Liwei Yang, S. Gurumani, Deming Chen, K. Rupnow","doi":"10.1109/FCCM.2016.38","DOIUrl":"https://doi.org/10.1109/FCCM.2016.38","url":null,"abstract":"Improved quality of results from high level synthesis (HLS) tools have led to their increased adoption in hardware design. However, functional verification of HLS-produced designs remains a major challenge. Once a bug is exposed, designers must backtrace thousands of signals and simulation cycles to determine the underlying cause. The challenge is further exacerbated with HLS-produced non-human-readable RTL. In this paper, we present AutoSLIDE, an automated cross-layer verification framework that instruments critical operations, detects discrepancies between software and hardware execution, and traces the suspect datapath tree to identify bug source for the detected discrepancy. AutoSLIDE also maintains mappings between RTL datapath operations, LLVM-IR operations, and C/C++ source code to precisely pinpoint the root-cause of bugs to the exact line/operation in source code, substantially reducing user effort to localize bugs. We demonstrate the effectiveness by detecting and localizing bugs from former versions of the CHStone benchmark suite. Furthermore, we demonstrate the efficiency of AutoSLIDE, with low overhead in HLS time (27%), software trace gathering (10%), and significantly reduced trace size and simulation time compared to exhaustive instrumentation.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121088980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs fpgaConvNet:一个在fpga上映射卷积神经网络的框架
Stylianos I. Venieris, C. Bouganis
Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a streaming application, the proposed framework employs the Synchronous Dataflow (SDF) model of computation as its basis and proposes a set of transformations on the SDF graph that explore the performance-resource design space, while taking into account platform-specific resource constraints. A comparison with existing ConvNet FPGA works shows that the proposed fully-automated methodology yields hardware designs that improve the performance density by up to 1.62× and reach up to 90.75% of the raw performance of architectures that are hand-tuned for particular ConvNets.
卷积神经网络(ConvNets)是一种强大的深度学习模型,为许多新出现的分类问题提供了最先进的精度。然而,卷积神经网络分类是一项计算量很大的任务,并且具有快速的复杂度缩放。本文介绍了fpgaConvNet,这是一种新颖的领域特定建模框架,以及将卷积网络映射到可重构fpga平台的自动设计方法。通过将ConvNet分类解释为流应用程序,提出的框架采用同步数据流(SDF)计算模型作为其基础,并在SDF图上提出一组转换,这些转换探索了性能资源设计空间,同时考虑了平台特定的资源约束。与现有ConvNet FPGA工作的比较表明,所提出的全自动方法产生的硬件设计将性能密度提高了1.62倍,并达到了针对特定ConvNets手动调整的架构的90.75%的原始性能。
{"title":"fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs","authors":"Stylianos I. Venieris, C. Bouganis","doi":"10.1109/FCCM.2016.22","DOIUrl":"https://doi.org/10.1109/FCCM.2016.22","url":null,"abstract":"Convolutional Neural Networks (ConvNets) are a powerful Deep Learning model, providing state-of-the-art accuracy to many emerging classification problems. However, ConvNet classification is a computationally heavy task, suffering from rapid complexity scaling. This paper presents fpgaConvNet, a novel domain-specific modelling framework together with an automated design methodology for the mapping of ConvNets onto reconfigurable FPGA-based platforms. By interpreting ConvNet classification as a streaming application, the proposed framework employs the Synchronous Dataflow (SDF) model of computation as its basis and proposes a set of transformations on the SDF graph that explore the performance-resource design space, while taking into account platform-specific resource constraints. A comparison with existing ConvNet FPGA works shows that the proposed fully-automated methodology yields hardware designs that improve the performance density by up to 1.62× and reach up to 90.75% of the raw performance of architectures that are hand-tuned for particular ConvNets.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"65 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127390895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 215
GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator GRVI密集阵:大规模并行RISC-V FPGA加速器加速器
J. Gray
GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a Hoplite NOC with 300-bit links. An example Kintex UltraScale 040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gb/s, and uses 12-17 W.
GRVI是一款fpga高效的RISC-V RV32I软处理器。Phalanx是一个并行处理器和加速器阵列框架。处理器和加速器组形成共享内存集群。集群之间相互连接,并通过带有300位链接的Hoplite NOC与极高带宽的I/O和存储设备相连。以Kintex UltraScale 040系统为例,该系统拥有400个RISC-V核,峰值吞吐量为100,000 MIPS,峰值共享内存带宽为600gb /s, NOC平分带宽为700gb /s,功耗为12- 17w。
{"title":"GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Accelerator","authors":"J. Gray","doi":"10.1109/FCCM.2016.12","DOIUrl":"https://doi.org/10.1109/FCCM.2016.12","url":null,"abstract":"GRVI is an FPGA-efficient RISC-V RV32I soft processor. Phalanx is a parallel processor and accelerator array framework. Groups of processors and accelerators form shared memory clusters. Clusters are interconnected with each other and with extreme bandwidth I/O and memory devices by a Hoplite NOC with 300-bit links. An example Kintex UltraScale 040 system has 400 RISC-V cores, peak throughput of 100,000 MIPS, peak shared memory bandwidth of 600 GB/s, NOC bisection bandwidth of 700 Gb/s, and uses 12-17 W.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Reconfiguration Control Networks for TMR Systems with Module-Based Recovery 基于模块恢复的TMR系统的重构控制网络
D. Agiakatsikas, N. T. H. Nguyen, Zhuoran Zhao, Tong Wu, E. Çetin, O. Diessel, Lingkan Gong
Field-Programmable Gate Arrays (FPGAs) provide ideal platforms for meeting the computational requirements of future space-based processing systems. However, FPGAs are susceptible to radiation-induced Single Event Upsets (SEUs). Techniques for dynamically reconfiguring corrupted modules of Triple Modular Redundant (TMR) components are well known. However, most of these techniques utilize resources that are themselves susceptible to SEUs to transfer reconfiguration requests from the TMR voters to a central reconfiguration controller. This paper evaluates the impact of these Reconfiguration Control Networks (RCNs) on the system's reliability and performance. We provide an overview of RCNs reported in the literature and compare them in terms of dependability, scalability and performance. We implemented our designs on a Xilinx Artix-7 FPGA to assess the resulting resource utilization and performance as well as to evaluate their soft error vulnerability using analytical techniques. We show that of the RCN topologies studied, an ICAP-based approach is the most reliable despite having the highest network latency. We also conclude that a module-based recovery approach is less reliable than scrubbing unless the RCN is triplicated and repaired when it suffers configuration memory errors.
现场可编程门阵列(fpga)为满足未来天基处理系统的计算需求提供了理想的平台。然而,fpga容易受到辐射引起的单事件干扰(SEUs)的影响。动态重新配置三模冗余(TMR)组件损坏模块的技术是众所周知的。然而,这些技术中的大多数都利用本身易受seu影响的资源,将来自TMR投票人的重新配置请求传输到中央重新配置控制器。本文评估了这些重构控制网络(rcn)对系统可靠性和性能的影响。我们概述了文献中报道的rcn,并在可靠性、可伸缩性和性能方面对它们进行了比较。我们在Xilinx Artix-7 FPGA上实现了我们的设计,以评估由此产生的资源利用率和性能,并使用分析技术评估其软错误漏洞。我们表明,在所研究的RCN拓扑中,尽管具有最高的网络延迟,但基于icap的方法是最可靠的。我们还得出结论,基于模块的恢复方法不如扫描可靠,除非RCN被复制三倍,并在遇到配置内存错误时进行修复。
{"title":"Reconfiguration Control Networks for TMR Systems with Module-Based Recovery","authors":"D. Agiakatsikas, N. T. H. Nguyen, Zhuoran Zhao, Tong Wu, E. Çetin, O. Diessel, Lingkan Gong","doi":"10.1109/FCCM.2016.30","DOIUrl":"https://doi.org/10.1109/FCCM.2016.30","url":null,"abstract":"Field-Programmable Gate Arrays (FPGAs) provide ideal platforms for meeting the computational requirements of future space-based processing systems. However, FPGAs are susceptible to radiation-induced Single Event Upsets (SEUs). Techniques for dynamically reconfiguring corrupted modules of Triple Modular Redundant (TMR) components are well known. However, most of these techniques utilize resources that are themselves susceptible to SEUs to transfer reconfiguration requests from the TMR voters to a central reconfiguration controller. This paper evaluates the impact of these Reconfiguration Control Networks (RCNs) on the system's reliability and performance. We provide an overview of RCNs reported in the literature and compare them in terms of dependability, scalability and performance. We implemented our designs on a Xilinx Artix-7 FPGA to assess the resulting resource utilization and performance as well as to evaluate their soft error vulnerability using analytical techniques. We show that of the RCN topologies studied, an ICAP-based approach is the most reliable despite having the highest network latency. We also conclude that a module-based recovery approach is less reliable than scrubbing unless the RCN is triplicated and repaired when it suffers configuration memory errors.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131652043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Sectors: Divide & Conquer and Softwarization in the Design and Validation of the Stratix® 10 FPGA 领域:Stratix®10 FPGA设计与验证中的分治和软件化
D. How, Sean Atsatt
The Stratix 10 project started with aggressive performance, size, and feature goals, all to be met on a lean schedule. Meeting these performance goals led to a restructuring of the entire configurable clock system into a regular gridded network, which subdivided the device into a composable system of "sectors". Sectors aligned with the needs of the project schedule, since they allowed complexity -- of specification, design, and validation -- to be addressed through "divide and conquer". Similarly, the customary "out-of-band" FPGA management functions including initialization, configuration, test, redundancy, scrubbing, and so on, were reconstituted to run on a collection of per-sector and supervisory processors interconnected by a NoC, whose distributed software would replace centralized tightly coupled finite state machines. This softwarization and parallelization reduced risk, increased flexibility, and increased data bandwidth. During development, parallel teams separately exercised each sector type and its local processor software via the sector's clock and NoC ports, accelerating validation on design databases two orders of magnitude smaller compared to previous methodologies. Even complex features can be added by including new NoC packet types and software rather than painfully adding wires to a rigid floor-plan.
Stratix 10项目以积极的性能、规模和功能目标开始,所有这些都要在精简的时间表上实现。为了满足这些性能目标,将整个可配置时钟系统重组为一个规则的网格网络,该网络将设备细分为一个可组合的“扇区”系统。部门与项目进度的需要保持一致,因为它们允许通过“分而治之”来解决规范、设计和验证的复杂性。类似地,习惯的“带外”FPGA管理功能(包括初始化、配置、测试、冗余、清洗等)被重构为运行在由NoC连接的每个扇区和监控处理器的集合上,NoC的分布式软件将取代集中式紧密耦合的有限状态机。这种软件化和并行化降低了风险,增加了灵活性,并增加了数据带宽。在开发过程中,并行团队通过扇区时钟和NoC端口分别运行每个扇区类型及其本地处理器软件,与之前的方法相比,加速了对设计数据库的验证,速度降低了两个数量级。即使是复杂的功能也可以通过添加新的NoC数据包类型和软件来添加,而不是痛苦地将线路添加到严格的平面图中。
{"title":"Sectors: Divide & Conquer and Softwarization in the Design and Validation of the Stratix® 10 FPGA","authors":"D. How, Sean Atsatt","doi":"10.1109/FCCM.2016.37","DOIUrl":"https://doi.org/10.1109/FCCM.2016.37","url":null,"abstract":"The Stratix 10 project started with aggressive performance, size, and feature goals, all to be met on a lean schedule. Meeting these performance goals led to a restructuring of the entire configurable clock system into a regular gridded network, which subdivided the device into a composable system of \"sectors\". Sectors aligned with the needs of the project schedule, since they allowed complexity -- of specification, design, and validation -- to be addressed through \"divide and conquer\". Similarly, the customary \"out-of-band\" FPGA management functions including initialization, configuration, test, redundancy, scrubbing, and so on, were reconstituted to run on a collection of per-sector and supervisory processors interconnected by a NoC, whose distributed software would replace centralized tightly coupled finite state machines. This softwarization and parallelization reduced risk, increased flexibility, and increased data bandwidth. During development, parallel teams separately exercised each sector type and its local processor software via the sector's clock and NoC ports, accelerating validation on design databases two orders of magnitude smaller compared to previous methodologies. Even complex features can be added by including new NoC packet types and software rather than painfully adding wires to a rigid floor-plan.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125328292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Dynamically Scheduled Architecture for the Synthesis of Graph Database Queries 图数据库查询合成的动态调度体系结构
Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi, M. Lattuada
Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.
数据分析应用程序,如图形数据库,表现出不规则的行为,这使得它们的加速变得非常重要。这些应用程序公开了大量的任务级并行性(Task Level Parallelism, TLP),但它们提供了细粒度的内存访问。
{"title":"A Dynamically Scheduled Architecture for the Synthesis of Graph Database Queries","authors":"Marco Minutoli, Vito Giovanni Castellana, Antonino Tumeo, Fabrizio Ferrandi, M. Lattuada","doi":"10.1109/FCCM.2016.41","DOIUrl":"https://doi.org/10.1109/FCCM.2016.41","url":null,"abstract":"Data analytics applications, such as graph databases, exibit irregular behaviors that make their acceleration non-trivial. These applications expose a significant amount of Task Level Parallelism (TLP), but they present fine grained memory accesses.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116812224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application-Aware Collective Communication (Extended Abstract) 应用感知集体通信(扩展摘要)
Jiayi Sheng, Qingqing Xiong, Chen Yang, M. Herbordt
Preliminary results are presented of hardware support for collective communication that takes advantage of a priori routing information.
给出了利用先验路由信息对集体通信的硬件支持的初步结果。
{"title":"Application-Aware Collective Communication (Extended Abstract)","authors":"Jiayi Sheng, Qingqing Xiong, Chen Yang, M. Herbordt","doi":"10.1109/FCCM.2016.55","DOIUrl":"https://doi.org/10.1109/FCCM.2016.55","url":null,"abstract":"Preliminary results are presented of hardware support for collective communication that takes advantage of a priori routing information.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127040376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Two-Hit Filter Synthesis for Genomic Database Search 基因组数据库搜索的双命中滤波器合成
Jordan A. Bradshaw, Rasha Karakchi, J. Bakos
Advancements in genomic sequencing technology is causing genomic database growth to outpace Moore's Law. This continues to make genomic database search a difficult problem and a popular target for emerging processing technologies. The de facto software tool for genomic database search is NCBI BLAST, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this paper we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on chip memory structure and allocation. Our algorithm transforms the database -- not the query -- into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on chip and the third off chip. Our results show that -- while performance is data dependent -- it is possible to achieve speedups of up to 8X based on the relative reduction in I/O of our approach versus that of NCBI BLAST. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes.
基因组测序技术的进步使得基因组数据库的增长速度超过了摩尔定律。这继续使基因组数据库搜索成为一个难题,也是新兴处理技术的热门目标。事实上,基因组数据库搜索的软件工具是NCBI BLAST,它通过将每个数据库查询转换为随后应用于数据库的过滤器来操作。这需要对每个查询进行数据库扫描,从根本上限制了I/O带宽的性能。在本文中,我们提出了NCBI BLAST算法的功能等效变体,该变体更适合于FPGA实现。这种算法的变体试图通过利用fpga特定的功能来减少I/O需求,例如高模式匹配吞吐量和明确的芯片内存结构和分配。我们的算法将数据库(而不是查询)转换为一个过滤器,该过滤器存储为三个表的分层排列,其中前两个表存储在芯片上,第三个表存储在芯片外。我们的研究结果表明,虽然性能取决于数据,但与NCBI BLAST相比,基于我们的方法相对减少的I/O,有可能实现高达8倍的加速。更重要的是,相对于NCBI BLAST,更大的数据库和查询工作负载会提高性能。
{"title":"Two-Hit Filter Synthesis for Genomic Database Search","authors":"Jordan A. Bradshaw, Rasha Karakchi, J. Bakos","doi":"10.1109/FCCM.2016.24","DOIUrl":"https://doi.org/10.1109/FCCM.2016.24","url":null,"abstract":"Advancements in genomic sequencing technology is causing genomic database growth to outpace Moore's Law. This continues to make genomic database search a difficult problem and a popular target for emerging processing technologies. The de facto software tool for genomic database search is NCBI BLAST, which operates by transforming each database query into a filter that is subsequently applied to the database. This requires a database scan for every query, fundamentally limiting its performance by I/O bandwidth. In this paper we present a functionally-equivalent variation on the NCBI BLAST algorithm that maps more suitably to an FPGA implementation. This variation of the algorithm attempts to reduce the I/O requirement by leveraging FPGA-specific capabilities, such as high pattern matching throughput and explicit on chip memory structure and allocation. Our algorithm transforms the database -- not the query -- into a filter that is stored as a hierarchical arrangement of three tables, the first two of which are stored on chip and the third off chip. Our results show that -- while performance is data dependent -- it is possible to achieve speedups of up to 8X based on the relative reduction in I/O of our approach versus that of NCBI BLAST. More importantly, the performance relative to NCBI BLAST improves with larger databases and query workload sizes.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"2020 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114466007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1