首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
Reconfigurable filtered acceleration of short read alignment 短读对齐的可重构滤波加速
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718408
James Arram, W. Luk, P. Jiang
Recent trends in the cost and demand of next generation DNA sequencing (NGS) has revealed a great computational challenge in analysing the massive quantities of sequenced data produced. Given that the projected increase in sequenced data far outstrips Moore's Law, the current technologies used to handle the data are likely to become insufficient. This paper explores the use of reconfigurable hardware in accelerating short read alignment. In this application, the positions of millions of short DNA sequences (called reads) are located in a known reference genome. This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware. In the proposed approach, specialised filters are designed to align short reads to a reference genome with a specific edit distance. The filters are arranged in a pipeline according to increasing edit distance, where short reads unable to be aligned by a given filter are forwarded to the next filter in the pipeline for further processing. Run-time reconfiguration is used to fully populate an accelerator device with each filter in the pipeline in turn. In our implementation a single FPGA is populated with specialised filters based on a novel bidirectional backtracking version of the FM-index, and it is found that in this particular implementation the alignment time can be up to 14.7 and 18.1 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs.
下一代DNA测序(NGS)的成本和需求的最新趋势表明,在分析产生的大量测序数据方面存在巨大的计算挑战。考虑到预计的测序数据增长远远超过摩尔定律,目前用于处理数据的技术可能会变得不足。本文探讨了可重构硬件在加速短读对齐中的应用。在这个应用程序中,数百万个短DNA序列(称为reads)的位置位于已知的参考基因组中。这项工作提出了一种新的通用方法来加速使用可重构硬件的基于后缀trie的短读对齐方法。在提出的方法中,专门的过滤器被设计用于将短reads与具有特定编辑距离的参考基因组对齐。过滤器根据增加的编辑距离排列在管道中,其中无法被给定过滤器对齐的短读取被转发到管道中的下一个过滤器进行进一步处理。运行时重新配置用于依次使用管道中的每个过滤器完全填充加速器设备。在我们的实现中,单个FPGA中填充了基于新型双向回溯版本的fm索引的专用滤波器,并且发现在这个特定的实现中,校准时间可以比在双Intel X5650 cpu上运行的SOAP2和BWA快14.7和18.1倍。
{"title":"Reconfigurable filtered acceleration of short read alignment","authors":"James Arram, W. Luk, P. Jiang","doi":"10.1109/FPT.2013.6718408","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718408","url":null,"abstract":"Recent trends in the cost and demand of next generation DNA sequencing (NGS) has revealed a great computational challenge in analysing the massive quantities of sequenced data produced. Given that the projected increase in sequenced data far outstrips Moore's Law, the current technologies used to handle the data are likely to become insufficient. This paper explores the use of reconfigurable hardware in accelerating short read alignment. In this application, the positions of millions of short DNA sequences (called reads) are located in a known reference genome. This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware. In the proposed approach, specialised filters are designed to align short reads to a reference genome with a specific edit distance. The filters are arranged in a pipeline according to increasing edit distance, where short reads unable to be aligned by a given filter are forwarded to the next filter in the pipeline for further processing. Run-time reconfiguration is used to fully populate an accelerator device with each filter in the pipeline in turn. In our implementation a single FPGA is populated with specialised filters based on a novel bidirectional backtracking version of the FM-index, and it is found that in this particular implementation the alignment time can be up to 14.7 and 18.1 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124494274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Maximum flow algorithms for maximum observability during FPGA debug 最大流量算法在FPGA调试期间最大的可观察性
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718324
Eddie Hung, Al-Shahna Jamal, S. Wilton
Due to the ever-increasing density and complexity of integrated circuits, FPGA prototyping has become a necessary part of the design process. To enhance observability into these devices, designers commonly insert trace-buffers to record and expose the values on a small subset of internal signals during live operation to help root-cause errors. For dense designs, routing congestion will restrict the number of signals that can be connected to these trace-buffers. In this work, we apply optimal network flow graph algorithms, a well studied technique, to the problem of transporting circuit signals to embedded trace-buffers for observation. Specifically, we apply a minimum cost maximum flow algorithm to gain maximum signal observability with minimum total wirelength. We showcase our techniques on both theoretical FPGA architectures using VPR, and with a Xilinx Virtex6 device, finding that for the latter, over 99.6% of all spare RAM inputs can be reclaimed for tracing across four large benchmarks.
由于集成电路的密度和复杂性不断增加,FPGA原型设计已经成为设计过程中必不可少的一部分。为了增强对这些设备的可观察性,设计人员通常会插入跟踪缓冲区,以记录和暴露在实时操作期间内部信号的一小部分上的值,以帮助找出根本原因错误。对于密集的设计,路由拥塞将限制可以连接到这些跟踪缓冲区的信号数量。在这项工作中,我们将最优网络流图算法(一种研究得很好的技术)应用于将电路信号传输到嵌入式跟踪缓冲区以进行观察的问题。具体而言,我们采用最小代价最大流量算法,以最小的总长度获得最大的信号可观测性。我们在使用VPR和Xilinx Virtex6设备的理论FPGA架构上展示了我们的技术,发现对于后者,超过99.6%的备用RAM输入可以被回收用于跨四个大型基准测试的跟踪。
{"title":"Maximum flow algorithms for maximum observability during FPGA debug","authors":"Eddie Hung, Al-Shahna Jamal, S. Wilton","doi":"10.1109/FPT.2013.6718324","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718324","url":null,"abstract":"Due to the ever-increasing density and complexity of integrated circuits, FPGA prototyping has become a necessary part of the design process. To enhance observability into these devices, designers commonly insert trace-buffers to record and expose the values on a small subset of internal signals during live operation to help root-cause errors. For dense designs, routing congestion will restrict the number of signals that can be connected to these trace-buffers. In this work, we apply optimal network flow graph algorithms, a well studied technique, to the problem of transporting circuit signals to embedded trace-buffers for observation. Specifically, we apply a minimum cost maximum flow algorithm to gain maximum signal observability with minimum total wirelength. We showcase our techniques on both theoretical FPGA architectures using VPR, and with a Xilinx Virtex6 device, finding that for the latter, over 99.6% of all spare RAM inputs can be reclaimed for tracing across four large benchmarks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126278495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
SOAP: Structural optimization of arithmetic expressions for high-level synthesis 用于高级综合的算术表达式的结构优化
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718340
Xitong Gao, Samuel Bayliss, G. Constantinides
This paper introduces SOAP, a new tool to automatically optimize the structure of arithmetic expressions for FPGA implementation as part of a high level synthesis flow, taking into account axiomatic rules derived from real arithmetic, such as distributivity, associativity and others. We explicitly target an optimized area/accuracy trade-off, allowing arithmetic expressions to be automatically re-written for this purpose. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on source-to-source transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent expressions from an original expression. Using formal semantics, we calculate two objectives, the accuracy of computation and an estimate of resource utilization in FPGA. The optimization of these objectives produces a Pareto frontier consisting of a set of expressions. This gives the synthesis tool the flexibility to choose an implementation satisfying constraints on both accuracy and resource usage. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using our tool to optimize the structure of a variety of real world and artificially generated examples in single precision, we improve either their accuracy or the resource utilization by up to 60%.
本文介绍了SOAP,它是一种新的工具,可以自动优化FPGA实现的算术表达式结构,作为高级综合流程的一部分,它考虑了来自实际算术的公理规则,如分布性、结合性等。我们明确地以优化的面积/精度权衡为目标,允许为此目的自动重写算术表达式。第一次,我们从软件静态分析中引入了严格的方法,特别是形式语义和抽象解释,来承担高层次合成的源到源转换。提出了新的抽象语义,从原始表达式生成等价表达式的可计算子集。利用形式语义,我们计算了FPGA的计算精度和资源利用率两个目标。这些目标的优化产生由一组表达式组成的帕累托边界。这使合成工具能够灵活地选择满足准确性和资源使用约束的实现。因此,我们超越了现有的文献,不仅优化了实现的精度要求,而且改变了实现本身的结构。使用我们的工具在单个精度下优化各种真实世界和人工生成的示例的结构,我们提高了它们的准确性或资源利用率高达60%。
{"title":"SOAP: Structural optimization of arithmetic expressions for high-level synthesis","authors":"Xitong Gao, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718340","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718340","url":null,"abstract":"This paper introduces SOAP, a new tool to automatically optimize the structure of arithmetic expressions for FPGA implementation as part of a high level synthesis flow, taking into account axiomatic rules derived from real arithmetic, such as distributivity, associativity and others. We explicitly target an optimized area/accuracy trade-off, allowing arithmetic expressions to be automatically re-written for this purpose. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on source-to-source transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent expressions from an original expression. Using formal semantics, we calculate two objectives, the accuracy of computation and an estimate of resource utilization in FPGA. The optimization of these objectives produces a Pareto frontier consisting of a set of expressions. This gives the synthesis tool the flexibility to choose an implementation satisfying constraints on both accuracy and resource usage. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using our tool to optimize the structure of a variety of real world and artificially generated examples in single precision, we improve either their accuracy or the resource utilization by up to 60%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128144119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
A hardware implementation of Bag of Words and Simhash for image recognition 用于图像识别的word和Simhash的硬件实现
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718403
Shengye Wang, Chen Liang, Xuegong Zhou, Wei Cao, Chen-Mie Wu, Xitian Fan, Lingli Wang
Algorithms such as Bag of Words and Simhash have been widely used in image recognition. To achieve better performance as well as energy-efficiency, a hardware implementation of these two algorithms is proposed in this paper. To the best of our knowledge, it is the first time that these algorithms have been implemented on hardware for image recognition purpose. The proposed implementation is able to generate a fingerprint of an image and find the closest match in the database accurately. It is implemented on Xilinx's Virtex-6 SX475T FPGA. Tradeoffs between high performance and low hardware overhead are obtained through proper parallelization. The experimental result shows that the proposed implementation can process 1,018 images per second, approximately 17.8x faster than software on Intel's 12-thread Xeon X5650 processor. On the other hand, the power consumption is 0.35x compared to software-based implementation. Thus, the overall advantage in energy-efficiency is as much as 46x. The proposed architecture is scalable, and is able to meet various requirements of image recognition.
诸如Bag of Words和Simhash等算法在图像识别中得到了广泛的应用。为了获得更好的性能和能源效率,本文提出了这两种算法的硬件实现。据我们所知,这是这些算法第一次在硬件上实现图像识别目的。提出的实现能够生成图像的指纹,并在数据库中准确地找到最接近的匹配。它在Xilinx的Virtex-6 SX475T FPGA上实现。通过适当的并行化,可以在高性能和低硬件开销之间取得平衡。实验结果表明,所提出的实现每秒可以处理1,018个图像,大约比英特尔12线程Xeon X5650处理器上的软件快17.8倍。另一方面,与基于软件的实现相比,功耗是0.35倍。因此,在能源效率方面的总体优势高达46倍。所提出的体系结构具有可扩展性,能够满足图像识别的各种需求。
{"title":"A hardware implementation of Bag of Words and Simhash for image recognition","authors":"Shengye Wang, Chen Liang, Xuegong Zhou, Wei Cao, Chen-Mie Wu, Xitian Fan, Lingli Wang","doi":"10.1109/FPT.2013.6718403","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718403","url":null,"abstract":"Algorithms such as Bag of Words and Simhash have been widely used in image recognition. To achieve better performance as well as energy-efficiency, a hardware implementation of these two algorithms is proposed in this paper. To the best of our knowledge, it is the first time that these algorithms have been implemented on hardware for image recognition purpose. The proposed implementation is able to generate a fingerprint of an image and find the closest match in the database accurately. It is implemented on Xilinx's Virtex-6 SX475T FPGA. Tradeoffs between high performance and low hardware overhead are obtained through proper parallelization. The experimental result shows that the proposed implementation can process 1,018 images per second, approximately 17.8x faster than software on Intel's 12-thread Xeon X5650 processor. On the other hand, the power consumption is 0.35x compared to software-based implementation. Thus, the overall advantage in energy-efficiency is as much as 46x. The proposed architecture is scalable, and is able to meet various requirements of image recognition.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlexGrip: A soft GPGPU for FPGAs FlexGrip:用于fpga的软GPGPU
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718358
K. Andryc, Murtaza Merchant, R. Tessier
Over the past decade, soft microprocessors and vector processors have been extensively used in FPGAs for a wide variety of applications. However, it is difficult to straightforwardly extend their functionality to support conditional and thread-based execution characteristic of general-purpose graphics processing units (GPGPUs) without recompiling FPGA hardware for each application. In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU without hardware recompilation. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of up to 30× versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip.
在过去的十年中,软微处理器和矢量处理器在fpga中得到了广泛的应用。然而,如果不为每个应用重新编译FPGA硬件,则很难直接扩展其功能以支持通用图形处理单元(gpgpu)的条件和基于线程的执行特性。在本文中,我们描述了FlexGrip的实现,FlexGrip是一种针对FPGA实现进行优化的软GPGPU架构。该架构支持直接CUDA编译成二进制文件,该二进制文件可在基于fpga的GPGPU上执行,无需硬件重新编译。我们的架构是可定制的,因此为FPGA设计人员提供了一个GPGPU内核的选择,显示性能与面积的权衡。我们的架构的好处是通过使用标准GPGPU编译工具编译的五个标准CUDA基准的集合来评估的。与MicroBlaze微处理器相比,利用FlexGrip提供的条件执行能力的设计实现了高达30倍的加速。
{"title":"FlexGrip: A soft GPGPU for FPGAs","authors":"K. Andryc, Murtaza Merchant, R. Tessier","doi":"10.1109/FPT.2013.6718358","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718358","url":null,"abstract":"Over the past decade, soft microprocessors and vector processors have been extensively used in FPGAs for a wide variety of applications. However, it is difficult to straightforwardly extend their functionality to support conditional and thread-based execution characteristic of general-purpose graphics processing units (GPGPUs) without recompiling FPGA hardware for each application. In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU without hardware recompilation. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of up to 30× versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130029473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
From C to Blokus Duo with LegUp high-level synthesis 从C到Blokus Duo与LegUp高级合成
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718424
J. Cai, Ruolong Lian, Mengyao Wang, Andrew Canis, Jongsok Choi, B. Fort, Eric Hart, Emily Miao, Yanyan Zhang, Nazanin Calagar, S. Brown, J. Anderson
We apply high-level synthesis (HLS) to generate Blokus Duo game-playing hardware for the FPT 2013 Design Competition [3]. Our design, written in C, is synthesized using the LegUp open-source HLS tool to Verilog, then subsequently mapped using vendor tools to an Altera Cyclone IV FPGA on DE2 board. Our software implementation is designed to be amenable to high-level synthesis, and includes a custom stack implementation, uses only integer arithmetic, and employs the use of bitwise logical operations to improve overall computational performance. The underlying AI decision making is based on alpha-beta pruning [2]. The performance of our synthesizable solution is gauged by playing against the Pentobi [8] - a “known good” C++ software implementation.
我们应用高级合成(HLS)为FPT 2013设计竞赛生成Blokus Duo游戏硬件[3]。我们的设计是用C语言编写的,使用LegUp开源HLS工具合成Verilog,然后使用供应商工具将其映射到DE2板上的Altera Cyclone IV FPGA。我们的软件实现被设计为适合高级综合,并且包括自定义堆栈实现,仅使用整数算术,并使用按位逻辑运算来提高整体计算性能。底层的AI决策是基于α - β修剪[2]。我们的可合成解决方案的性能是通过与Pentobi[8]进行比较来衡量的——Pentobi是一个“已知的好”c++软件实现。
{"title":"From C to Blokus Duo with LegUp high-level synthesis","authors":"J. Cai, Ruolong Lian, Mengyao Wang, Andrew Canis, Jongsok Choi, B. Fort, Eric Hart, Emily Miao, Yanyan Zhang, Nazanin Calagar, S. Brown, J. Anderson","doi":"10.1109/FPT.2013.6718424","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718424","url":null,"abstract":"We apply high-level synthesis (HLS) to generate Blokus Duo game-playing hardware for the FPT 2013 Design Competition [3]. Our design, written in C, is synthesized using the LegUp open-source HLS tool to Verilog, then subsequently mapped using vendor tools to an Altera Cyclone IV FPGA on DE2 board. Our software implementation is designed to be amenable to high-level synthesis, and includes a custom stack implementation, uses only integer arithmetic, and employs the use of bitwise logical operations to improve overall computational performance. The underlying AI decision making is based on alpha-beta pruning [2]. The performance of our synthesizable solution is gauged by playing against the Pentobi [8] - a “known good” C++ software implementation.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130446212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Maximizing speed and density of tiled FPGA overlays via partitioning 通过分区最大化平铺FPGA覆盖的速度和密度
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718360
Charles Eric LaForest, J. Gregory Steffan
Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.
大型FPGA设计项目的常见做法是将子项目划分为单独的综合分区,以便随着每个子项目的发展进行增量重新编译。相比之下,较小的设计项目避免分区,以便让CAD工具能够自由地执行尽可能多的全局优化,因为它们知道这些优化通常会提高性能和可能的面积。在本文中,我们展示了由重复组件组成的高速平铺设计,因此具有多位置(等效逻辑的多个实例),设计人员可以使用分区来保持多位置并提高性能。我们特别关注SIMD软处理器的通道和由它们组成的多核网格,由Quartus 12.1针对Stratix IV EP4SE230F29C2设备编译。我们证明,对编译时间的影响可以忽略不计(小于±10%):(i)我们可以使用分区向CAD工具提供有关在设计中保留多位置的高级信息,而无需对设计描述或CAD工具设置进行低级微管理;(ii)通过在SIMD软处理器中保留多位置,我们可以增加频率(最多31%)和计算密度(最多15%);(iii)分区提高了软处理器网格的密度和速度(高达51%和54%),跨越许多构建块配置和网格几何形状;(iv)分区的改进随着平铺计算元素(SIMD通道或网格节点)数量的增加而增加。作为分区好处的一个例子,102个标量软处理器的网格将其工作频率从284提高到437 MHz,其峰值性能从28,968提高到44,574 MIPS,而其逻辑面积仅增加了0.85%。
{"title":"Maximizing speed and density of tiled FPGA overlays via partitioning","authors":"Charles Eric LaForest, J. Gregory Steffan","doi":"10.1109/FPT.2013.6718360","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718360","url":null,"abstract":"Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126731154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Bitwidth-optimized hardware accelerators with software fallback 位宽优化的硬件加速器与软件回退
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718343
Ana Klimovic, J. Anderson
We propose the high-level synthesis of an FPGA-based hybrid computing system, where the implementations of compute-intensive functions are available in both software, and as hardware accelerators. The accelerators are optimized to handle common-case inputs, as opposed to worst-case inputs, allowing accelerator area to be reduced by 28%, on average, while retaining the majority of performance advantages associated with a hardware versus software implementation. When inputs exceed the range that the hardware accelerators can handle, a software fallback is automatically triggered. Optimization of the accelerator area is achieved by reducing datapath widths based on application profiling of variable ranges in software (under typical datasets). The selected widths are passed to a high-level synthesis tool which generates the accelerator for a given function. The optimized accelerators with software fallback capability are generated automatically by our framework, with minimal user intervention. Our study explores the trade-offs of delay and area for benchmarks implemented on an Altera Cyclone II FPGA.
我们提出了基于fpga的混合计算系统的高级综合,其中计算密集型功能的实现在软件和硬件加速器中都是可用的。对加速器进行了优化,以处理常见情况的输入,而不是最坏情况的输入,从而使加速器的面积平均减少28%,同时保留了硬件实现与软件实现相关的大多数性能优势。当输入超出硬件加速器可以处理的范围时,将自动触发软件回退。加速器区域的优化是通过基于软件中可变范围的应用程序分析(在典型数据集下)减少数据路径宽度来实现的。选定的宽度被传递给高级合成工具,该工具为给定函数生成加速器。具有软件回退功能的优化加速器由我们的框架自动生成,用户干预最少。我们的研究探讨了在Altera Cyclone II FPGA上实现基准测试的延迟和面积权衡。
{"title":"Bitwidth-optimized hardware accelerators with software fallback","authors":"Ana Klimovic, J. Anderson","doi":"10.1109/FPT.2013.6718343","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718343","url":null,"abstract":"We propose the high-level synthesis of an FPGA-based hybrid computing system, where the implementations of compute-intensive functions are available in both software, and as hardware accelerators. The accelerators are optimized to handle common-case inputs, as opposed to worst-case inputs, allowing accelerator area to be reduced by 28%, on average, while retaining the majority of performance advantages associated with a hardware versus software implementation. When inputs exceed the range that the hardware accelerators can handle, a software fallback is automatically triggered. Optimization of the accelerator area is achieved by reducing datapath widths based on application profiling of variable ranges in software (under typical datasets). The selected widths are passed to a high-level synthesis tool which generates the accelerator for a given function. The optimized accelerators with software fallback capability are generated automatically by our framework, with minimal user intervention. Our study explores the trade-offs of delay and area for benchmarks implemented on an Altera Cyclone II FPGA.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126799282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Accelerating iterative algorithms with asynchronous accumulative updates on FPGAs fpga上异步累积更新的加速迭代算法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718332
D. Unnikrishnan, S. Virupaksha, Lekshmi Krishnan, Lixin Gao, R. Tessier
Iterative algorithms represent a pervasive class of data mining, web search and scientific computing applications. In iterative algorithms, a final result is derived by performing repetitive computations on an input data set. Existing techniques to parallelize such algorithms typically use software frameworks such as MapReduce and Hadoop to distribute data for an iteration across multiple CPU-based workstations in a cluster and collect per-iteration results. These platforms are marked by the need to synchronize data computations at iteration boundaries, impeding system performance. In this paper, we demonstrate that FPGAs in distributed computing systems can serve a vital role in breaking this synchronization barrier with the help of asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers allowing individual nodes in a cluster to independently make progress towards the final outcome. Computation is dynamically prioritized to accelerate algorithm convergence. A general-class of iterative algorithms have been implemented on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers up to 154× speedup versus a standard Hadoop-based CPU-workstation. Improved performance is achieved by clusters of FPGAs.
迭代算法在数据挖掘、网络搜索和科学计算应用中无处不在。在迭代算法中,通过对输入数据集进行重复计算得到最终结果。现有的算法并行化技术通常使用MapReduce和Hadoop等软件框架,在集群中跨多个基于cpu的工作站分发迭代数据,并收集每次迭代的结果。这些平台的特点是需要在迭代边界同步数据计算,从而阻碍了系统性能。在本文中,我们证明了分布式计算系统中的fpga可以在异步累积更新的帮助下打破这种同步障碍方面发挥重要作用。这些更新允许大量数据点的中间结果的积累,而不需要基于迭代的障碍,允许集群中的单个节点独立地向最终结果前进。动态优化计算优先级,加快算法收敛速度。在一个由四个fpga组成的集群上实现了一类通用的迭代算法。通过在通用CPU上实现异步累积更新,可以实现7倍的加速。与标准的基于hadoop的cpu工作站相比,该系统提供了高达154倍的加速。通过fpga集群实现性能的提高。
{"title":"Accelerating iterative algorithms with asynchronous accumulative updates on FPGAs","authors":"D. Unnikrishnan, S. Virupaksha, Lekshmi Krishnan, Lixin Gao, R. Tessier","doi":"10.1109/FPT.2013.6718332","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718332","url":null,"abstract":"Iterative algorithms represent a pervasive class of data mining, web search and scientific computing applications. In iterative algorithms, a final result is derived by performing repetitive computations on an input data set. Existing techniques to parallelize such algorithms typically use software frameworks such as MapReduce and Hadoop to distribute data for an iteration across multiple CPU-based workstations in a cluster and collect per-iteration results. These platforms are marked by the need to synchronize data computations at iteration boundaries, impeding system performance. In this paper, we demonstrate that FPGAs in distributed computing systems can serve a vital role in breaking this synchronization barrier with the help of asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers allowing individual nodes in a cluster to independently make progress towards the final outcome. Computation is dynamically prioritized to accelerate algorithm convergence. A general-class of iterative algorithms have been implemented on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers up to 154× speedup versus a standard Hadoop-based CPU-workstation. Improved performance is achieved by clusters of FPGAs.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121053841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automated multi-device placement, I/O voltage supply assignment, and pin assignment in circuit board design 电路板设计中的自动化多器件放置,I/O电压分配和引脚分配
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718363
D. Seemuth, Katherine Morrow
Embedded systems often contain many components, some with multiple Field Programmable Gate Arrays (FPGAs). Designing Printed Circuit Boards (PCBs) for these systems can be a complex process that is often tedious, error-prone, and time-intensive. Existing computer-aided design tools require designers to manually insert components and explicitly define the connections between every component on the PCB - a cumbersome process. A fast PCB design framework requiring reduced designer time and effort would be particularly advantageous for rapid prototyping and short production run PCBs. Therefore, this paper proposes a novel, freely-available open-source framework to capture design intent and automatically implement the design details. Designers express connectivity at a higher level of abstraction than enumerating or drawing each individual trace between components. Given the components and connection requirements, the proposed framework automatically generates component placements, I/O voltage supply assignments, and FPGA pin assignments to minimize trace length. We also propose a novel method to improve trace length estimations during placement, before FPGA pins have actually been assigned to those connections. The proposed framework quickly explores large solution spaces, enabling rapid prototyping and design space exploration, and can lead to lower costs in design time and other non-recurring expenses. We demonstrate that it produces favorable results for various design requirements, which suggests the framework will be especially appreciated by designers of systems with multiple FPGAs having large numbers of flexible pins.
嵌入式系统通常包含许多组件,其中一些包含多个现场可编程门阵列(fpga)。为这些系统设计印刷电路板(pcb)可能是一个复杂的过程,通常是乏味的,容易出错的,耗时的。现有的计算机辅助设计工具要求设计人员手动插入组件,并明确定义PCB上每个组件之间的连接,这是一个繁琐的过程。一个快速的PCB设计框架需要减少设计人员的时间和精力,这对于快速原型设计和短时间生产PCB尤其有利。因此,本文提出了一种新的、免费的开源框架来捕获设计意图并自动实现设计细节。设计人员在更高的抽象层次上表达连接性,而不是枚举或绘制组件之间的每个单独的轨迹。给定组件和连接要求,提议的框架自动生成组件位置、I/O电压供应分配和FPGA引脚分配,以最小化走线长度。我们还提出了一种新的方法来改进在FPGA引脚实际分配给这些连接之前放置期间的跟踪长度估计。所建议的框架可以快速探索大型解决方案空间,支持快速原型和设计空间探索,并且可以降低设计时间和其他非重复性费用的成本。我们证明了它对各种设计要求产生了有利的结果,这表明该框架将特别受到具有大量柔性引脚的多个fpga系统的设计者的赞赏。
{"title":"Automated multi-device placement, I/O voltage supply assignment, and pin assignment in circuit board design","authors":"D. Seemuth, Katherine Morrow","doi":"10.1109/FPT.2013.6718363","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718363","url":null,"abstract":"Embedded systems often contain many components, some with multiple Field Programmable Gate Arrays (FPGAs). Designing Printed Circuit Boards (PCBs) for these systems can be a complex process that is often tedious, error-prone, and time-intensive. Existing computer-aided design tools require designers to manually insert components and explicitly define the connections between every component on the PCB - a cumbersome process. A fast PCB design framework requiring reduced designer time and effort would be particularly advantageous for rapid prototyping and short production run PCBs. Therefore, this paper proposes a novel, freely-available open-source framework to capture design intent and automatically implement the design details. Designers express connectivity at a higher level of abstraction than enumerating or drawing each individual trace between components. Given the components and connection requirements, the proposed framework automatically generates component placements, I/O voltage supply assignments, and FPGA pin assignments to minimize trace length. We also propose a novel method to improve trace length estimations during placement, before FPGA pins have actually been assigned to those connections. The proposed framework quickly explores large solution spaces, enabling rapid prototyping and design space exploration, and can lead to lower costs in design time and other non-recurring expenses. We demonstrate that it produces favorable results for various design requirements, which suggests the framework will be especially appreciated by designers of systems with multiple FPGAs having large numbers of flexible pins.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126986581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1