Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718408
James Arram, W. Luk, P. Jiang
Recent trends in the cost and demand of next generation DNA sequencing (NGS) has revealed a great computational challenge in analysing the massive quantities of sequenced data produced. Given that the projected increase in sequenced data far outstrips Moore's Law, the current technologies used to handle the data are likely to become insufficient. This paper explores the use of reconfigurable hardware in accelerating short read alignment. In this application, the positions of millions of short DNA sequences (called reads) are located in a known reference genome. This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware. In the proposed approach, specialised filters are designed to align short reads to a reference genome with a specific edit distance. The filters are arranged in a pipeline according to increasing edit distance, where short reads unable to be aligned by a given filter are forwarded to the next filter in the pipeline for further processing. Run-time reconfiguration is used to fully populate an accelerator device with each filter in the pipeline in turn. In our implementation a single FPGA is populated with specialised filters based on a novel bidirectional backtracking version of the FM-index, and it is found that in this particular implementation the alignment time can be up to 14.7 and 18.1 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs.
{"title":"Reconfigurable filtered acceleration of short read alignment","authors":"James Arram, W. Luk, P. Jiang","doi":"10.1109/FPT.2013.6718408","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718408","url":null,"abstract":"Recent trends in the cost and demand of next generation DNA sequencing (NGS) has revealed a great computational challenge in analysing the massive quantities of sequenced data produced. Given that the projected increase in sequenced data far outstrips Moore's Law, the current technologies used to handle the data are likely to become insufficient. This paper explores the use of reconfigurable hardware in accelerating short read alignment. In this application, the positions of millions of short DNA sequences (called reads) are located in a known reference genome. This work proposes a new general approach for accelerating suffix-trie based short read alignment methods using reconfigurable hardware. In the proposed approach, specialised filters are designed to align short reads to a reference genome with a specific edit distance. The filters are arranged in a pipeline according to increasing edit distance, where short reads unable to be aligned by a given filter are forwarded to the next filter in the pipeline for further processing. Run-time reconfiguration is used to fully populate an accelerator device with each filter in the pipeline in turn. In our implementation a single FPGA is populated with specialised filters based on a novel bidirectional backtracking version of the FM-index, and it is found that in this particular implementation the alignment time can be up to 14.7 and 18.1 times faster than SOAP2 and BWA run on dual Intel X5650 CPUs.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124494274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718324
Eddie Hung, Al-Shahna Jamal, S. Wilton
Due to the ever-increasing density and complexity of integrated circuits, FPGA prototyping has become a necessary part of the design process. To enhance observability into these devices, designers commonly insert trace-buffers to record and expose the values on a small subset of internal signals during live operation to help root-cause errors. For dense designs, routing congestion will restrict the number of signals that can be connected to these trace-buffers. In this work, we apply optimal network flow graph algorithms, a well studied technique, to the problem of transporting circuit signals to embedded trace-buffers for observation. Specifically, we apply a minimum cost maximum flow algorithm to gain maximum signal observability with minimum total wirelength. We showcase our techniques on both theoretical FPGA architectures using VPR, and with a Xilinx Virtex6 device, finding that for the latter, over 99.6% of all spare RAM inputs can be reclaimed for tracing across four large benchmarks.
{"title":"Maximum flow algorithms for maximum observability during FPGA debug","authors":"Eddie Hung, Al-Shahna Jamal, S. Wilton","doi":"10.1109/FPT.2013.6718324","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718324","url":null,"abstract":"Due to the ever-increasing density and complexity of integrated circuits, FPGA prototyping has become a necessary part of the design process. To enhance observability into these devices, designers commonly insert trace-buffers to record and expose the values on a small subset of internal signals during live operation to help root-cause errors. For dense designs, routing congestion will restrict the number of signals that can be connected to these trace-buffers. In this work, we apply optimal network flow graph algorithms, a well studied technique, to the problem of transporting circuit signals to embedded trace-buffers for observation. Specifically, we apply a minimum cost maximum flow algorithm to gain maximum signal observability with minimum total wirelength. We showcase our techniques on both theoretical FPGA architectures using VPR, and with a Xilinx Virtex6 device, finding that for the latter, over 99.6% of all spare RAM inputs can be reclaimed for tracing across four large benchmarks.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126278495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718340
Xitong Gao, Samuel Bayliss, G. Constantinides
This paper introduces SOAP, a new tool to automatically optimize the structure of arithmetic expressions for FPGA implementation as part of a high level synthesis flow, taking into account axiomatic rules derived from real arithmetic, such as distributivity, associativity and others. We explicitly target an optimized area/accuracy trade-off, allowing arithmetic expressions to be automatically re-written for this purpose. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on source-to-source transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent expressions from an original expression. Using formal semantics, we calculate two objectives, the accuracy of computation and an estimate of resource utilization in FPGA. The optimization of these objectives produces a Pareto frontier consisting of a set of expressions. This gives the synthesis tool the flexibility to choose an implementation satisfying constraints on both accuracy and resource usage. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using our tool to optimize the structure of a variety of real world and artificially generated examples in single precision, we improve either their accuracy or the resource utilization by up to 60%.
{"title":"SOAP: Structural optimization of arithmetic expressions for high-level synthesis","authors":"Xitong Gao, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718340","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718340","url":null,"abstract":"This paper introduces SOAP, a new tool to automatically optimize the structure of arithmetic expressions for FPGA implementation as part of a high level synthesis flow, taking into account axiomatic rules derived from real arithmetic, such as distributivity, associativity and others. We explicitly target an optimized area/accuracy trade-off, allowing arithmetic expressions to be automatically re-written for this purpose. For the first time, we bring rigorous approaches from software static analysis, specifically formal semantics and abstract interpretation, to bear on source-to-source transformation for high-level synthesis. New abstract semantics are developed to generate a computable subset of equivalent expressions from an original expression. Using formal semantics, we calculate two objectives, the accuracy of computation and an estimate of resource utilization in FPGA. The optimization of these objectives produces a Pareto frontier consisting of a set of expressions. This gives the synthesis tool the flexibility to choose an implementation satisfying constraints on both accuracy and resource usage. We thus go beyond existing literature by not only optimizing the precision requirements of an implementation, but changing the structure of the implementation itself. Using our tool to optimize the structure of a variety of real world and artificially generated examples in single precision, we improve either their accuracy or the resource utilization by up to 60%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128144119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Algorithms such as Bag of Words and Simhash have been widely used in image recognition. To achieve better performance as well as energy-efficiency, a hardware implementation of these two algorithms is proposed in this paper. To the best of our knowledge, it is the first time that these algorithms have been implemented on hardware for image recognition purpose. The proposed implementation is able to generate a fingerprint of an image and find the closest match in the database accurately. It is implemented on Xilinx's Virtex-6 SX475T FPGA. Tradeoffs between high performance and low hardware overhead are obtained through proper parallelization. The experimental result shows that the proposed implementation can process 1,018 images per second, approximately 17.8x faster than software on Intel's 12-thread Xeon X5650 processor. On the other hand, the power consumption is 0.35x compared to software-based implementation. Thus, the overall advantage in energy-efficiency is as much as 46x. The proposed architecture is scalable, and is able to meet various requirements of image recognition.
诸如Bag of Words和Simhash等算法在图像识别中得到了广泛的应用。为了获得更好的性能和能源效率,本文提出了这两种算法的硬件实现。据我们所知,这是这些算法第一次在硬件上实现图像识别目的。提出的实现能够生成图像的指纹,并在数据库中准确地找到最接近的匹配。它在Xilinx的Virtex-6 SX475T FPGA上实现。通过适当的并行化,可以在高性能和低硬件开销之间取得平衡。实验结果表明,所提出的实现每秒可以处理1,018个图像,大约比英特尔12线程Xeon X5650处理器上的软件快17.8倍。另一方面,与基于软件的实现相比,功耗是0.35倍。因此,在能源效率方面的总体优势高达46倍。所提出的体系结构具有可扩展性,能够满足图像识别的各种需求。
{"title":"A hardware implementation of Bag of Words and Simhash for image recognition","authors":"Shengye Wang, Chen Liang, Xuegong Zhou, Wei Cao, Chen-Mie Wu, Xitian Fan, Lingli Wang","doi":"10.1109/FPT.2013.6718403","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718403","url":null,"abstract":"Algorithms such as Bag of Words and Simhash have been widely used in image recognition. To achieve better performance as well as energy-efficiency, a hardware implementation of these two algorithms is proposed in this paper. To the best of our knowledge, it is the first time that these algorithms have been implemented on hardware for image recognition purpose. The proposed implementation is able to generate a fingerprint of an image and find the closest match in the database accurately. It is implemented on Xilinx's Virtex-6 SX475T FPGA. Tradeoffs between high performance and low hardware overhead are obtained through proper parallelization. The experimental result shows that the proposed implementation can process 1,018 images per second, approximately 17.8x faster than software on Intel's 12-thread Xeon X5650 processor. On the other hand, the power consumption is 0.35x compared to software-based implementation. Thus, the overall advantage in energy-efficiency is as much as 46x. The proposed architecture is scalable, and is able to meet various requirements of image recognition.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116869883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718358
K. Andryc, Murtaza Merchant, R. Tessier
Over the past decade, soft microprocessors and vector processors have been extensively used in FPGAs for a wide variety of applications. However, it is difficult to straightforwardly extend their functionality to support conditional and thread-based execution characteristic of general-purpose graphics processing units (GPGPUs) without recompiling FPGA hardware for each application. In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU without hardware recompilation. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of up to 30× versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip.
{"title":"FlexGrip: A soft GPGPU for FPGAs","authors":"K. Andryc, Murtaza Merchant, R. Tessier","doi":"10.1109/FPT.2013.6718358","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718358","url":null,"abstract":"Over the past decade, soft microprocessors and vector processors have been extensively used in FPGAs for a wide variety of applications. However, it is difficult to straightforwardly extend their functionality to support conditional and thread-based execution characteristic of general-purpose graphics processing units (GPGPUs) without recompiling FPGA hardware for each application. In this paper, we describe the implementation of FlexGrip, a soft GPGPU architecture which has been optimized for FPGA implementation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU without hardware recompilation. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. The benefits of our architecture are evaluated for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of up to 30× versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130029473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718424
J. Cai, Ruolong Lian, Mengyao Wang, Andrew Canis, Jongsok Choi, B. Fort, Eric Hart, Emily Miao, Yanyan Zhang, Nazanin Calagar, S. Brown, J. Anderson
We apply high-level synthesis (HLS) to generate Blokus Duo game-playing hardware for the FPT 2013 Design Competition [3]. Our design, written in C, is synthesized using the LegUp open-source HLS tool to Verilog, then subsequently mapped using vendor tools to an Altera Cyclone IV FPGA on DE2 board. Our software implementation is designed to be amenable to high-level synthesis, and includes a custom stack implementation, uses only integer arithmetic, and employs the use of bitwise logical operations to improve overall computational performance. The underlying AI decision making is based on alpha-beta pruning [2]. The performance of our synthesizable solution is gauged by playing against the Pentobi [8] - a “known good” C++ software implementation.
我们应用高级合成(HLS)为FPT 2013设计竞赛生成Blokus Duo游戏硬件[3]。我们的设计是用C语言编写的,使用LegUp开源HLS工具合成Verilog,然后使用供应商工具将其映射到DE2板上的Altera Cyclone IV FPGA。我们的软件实现被设计为适合高级综合,并且包括自定义堆栈实现,仅使用整数算术,并使用按位逻辑运算来提高整体计算性能。底层的AI决策是基于α - β修剪[2]。我们的可合成解决方案的性能是通过与Pentobi[8]进行比较来衡量的——Pentobi是一个“已知的好”c++软件实现。
{"title":"From C to Blokus Duo with LegUp high-level synthesis","authors":"J. Cai, Ruolong Lian, Mengyao Wang, Andrew Canis, Jongsok Choi, B. Fort, Eric Hart, Emily Miao, Yanyan Zhang, Nazanin Calagar, S. Brown, J. Anderson","doi":"10.1109/FPT.2013.6718424","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718424","url":null,"abstract":"We apply high-level synthesis (HLS) to generate Blokus Duo game-playing hardware for the FPT 2013 Design Competition [3]. Our design, written in C, is synthesized using the LegUp open-source HLS tool to Verilog, then subsequently mapped using vendor tools to an Altera Cyclone IV FPGA on DE2 board. Our software implementation is designed to be amenable to high-level synthesis, and includes a custom stack implementation, uses only integer arithmetic, and employs the use of bitwise logical operations to improve overall computational performance. The underlying AI decision making is based on alpha-beta pruning [2]. The performance of our synthesizable solution is gauged by playing against the Pentobi [8] - a “known good” C++ software implementation.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130446212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718360
Charles Eric LaForest, J. Gregory Steffan
Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.
大型FPGA设计项目的常见做法是将子项目划分为单独的综合分区,以便随着每个子项目的发展进行增量重新编译。相比之下,较小的设计项目避免分区,以便让CAD工具能够自由地执行尽可能多的全局优化,因为它们知道这些优化通常会提高性能和可能的面积。在本文中,我们展示了由重复组件组成的高速平铺设计,因此具有多位置(等效逻辑的多个实例),设计人员可以使用分区来保持多位置并提高性能。我们特别关注SIMD软处理器的通道和由它们组成的多核网格,由Quartus 12.1针对Stratix IV EP4SE230F29C2设备编译。我们证明,对编译时间的影响可以忽略不计(小于±10%):(i)我们可以使用分区向CAD工具提供有关在设计中保留多位置的高级信息,而无需对设计描述或CAD工具设置进行低级微管理;(ii)通过在SIMD软处理器中保留多位置,我们可以增加频率(最多31%)和计算密度(最多15%);(iii)分区提高了软处理器网格的密度和速度(高达51%和54%),跨越许多构建块配置和网格几何形状;(iv)分区的改进随着平铺计算元素(SIMD通道或网格节点)数量的增加而增加。作为分区好处的一个例子,102个标量软处理器的网格将其工作频率从284提高到437 MHz,其峰值性能从28,968提高到44,574 MIPS,而其逻辑面积仅增加了0.85%。
{"title":"Maximizing speed and density of tiled FPGA overlays via partitioning","authors":"Charles Eric LaForest, J. Gregory Steffan","doi":"10.1109/FPT.2013.6718360","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718360","url":null,"abstract":"Common practice for large FPGA design projects is to divide sub-projects into separate synthesis partitions to allow incremental recompilation as each sub-project evolves. In contrast, smaller design projects avoid partitioning to give the CAD tool the freedom to perform as many global optimizations as possible, knowing that the optimizations normally improve performance and possibly area. In this paper, we show that for high-speed tiled designs composed of duplicated components and hence having multi-localities (multiple instances of equivalent logic), a designer can use partitioning to preserve multi-locality and improve performance. In particular, we focus on the lanes of SIMD soft processors and multicore meshes composed of them, as compiled by Quartus 12.1 targeting a Stratix IV EP4SE230F29C2 device. We demonstrate that, with negligible impact on compile time (less than ±10%): (i) we can use partitioning to provide high-level information to the CAD tool about preserving multi-localities in a design, without low-level micro-managing of the design description or CAD tool settings; (ii) by preserving multi-localities within SIMD soft processors, we can increase both frequency (by up to 31%) and compute density (by up to 15%); (iii) partitioning improves the density and speed (by up to 51 and 54%) of a mesh of soft processors, across many building block configurations and mesh geometries; (iv) the improvements from partitioning increase as the number of tiled computing elements (SIMD lanes or mesh nodes) increases. As an example of the benefits of partitioning, a mesh of 102 scalar soft processors improves its operating frequency from 284 up to 437 MHz, its peak performance from 28,968 up to 44,574 MIPS, while increasing its logic area by only 0.85%.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126731154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718343
Ana Klimovic, J. Anderson
We propose the high-level synthesis of an FPGA-based hybrid computing system, where the implementations of compute-intensive functions are available in both software, and as hardware accelerators. The accelerators are optimized to handle common-case inputs, as opposed to worst-case inputs, allowing accelerator area to be reduced by 28%, on average, while retaining the majority of performance advantages associated with a hardware versus software implementation. When inputs exceed the range that the hardware accelerators can handle, a software fallback is automatically triggered. Optimization of the accelerator area is achieved by reducing datapath widths based on application profiling of variable ranges in software (under typical datasets). The selected widths are passed to a high-level synthesis tool which generates the accelerator for a given function. The optimized accelerators with software fallback capability are generated automatically by our framework, with minimal user intervention. Our study explores the trade-offs of delay and area for benchmarks implemented on an Altera Cyclone II FPGA.
我们提出了基于fpga的混合计算系统的高级综合,其中计算密集型功能的实现在软件和硬件加速器中都是可用的。对加速器进行了优化,以处理常见情况的输入,而不是最坏情况的输入,从而使加速器的面积平均减少28%,同时保留了硬件实现与软件实现相关的大多数性能优势。当输入超出硬件加速器可以处理的范围时,将自动触发软件回退。加速器区域的优化是通过基于软件中可变范围的应用程序分析(在典型数据集下)减少数据路径宽度来实现的。选定的宽度被传递给高级合成工具,该工具为给定函数生成加速器。具有软件回退功能的优化加速器由我们的框架自动生成,用户干预最少。我们的研究探讨了在Altera Cyclone II FPGA上实现基准测试的延迟和面积权衡。
{"title":"Bitwidth-optimized hardware accelerators with software fallback","authors":"Ana Klimovic, J. Anderson","doi":"10.1109/FPT.2013.6718343","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718343","url":null,"abstract":"We propose the high-level synthesis of an FPGA-based hybrid computing system, where the implementations of compute-intensive functions are available in both software, and as hardware accelerators. The accelerators are optimized to handle common-case inputs, as opposed to worst-case inputs, allowing accelerator area to be reduced by 28%, on average, while retaining the majority of performance advantages associated with a hardware versus software implementation. When inputs exceed the range that the hardware accelerators can handle, a software fallback is automatically triggered. Optimization of the accelerator area is achieved by reducing datapath widths based on application profiling of variable ranges in software (under typical datasets). The selected widths are passed to a high-level synthesis tool which generates the accelerator for a given function. The optimized accelerators with software fallback capability are generated automatically by our framework, with minimal user intervention. Our study explores the trade-offs of delay and area for benchmarks implemented on an Altera Cyclone II FPGA.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126799282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718332
D. Unnikrishnan, S. Virupaksha, Lekshmi Krishnan, Lixin Gao, R. Tessier
Iterative algorithms represent a pervasive class of data mining, web search and scientific computing applications. In iterative algorithms, a final result is derived by performing repetitive computations on an input data set. Existing techniques to parallelize such algorithms typically use software frameworks such as MapReduce and Hadoop to distribute data for an iteration across multiple CPU-based workstations in a cluster and collect per-iteration results. These platforms are marked by the need to synchronize data computations at iteration boundaries, impeding system performance. In this paper, we demonstrate that FPGAs in distributed computing systems can serve a vital role in breaking this synchronization barrier with the help of asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers allowing individual nodes in a cluster to independently make progress towards the final outcome. Computation is dynamically prioritized to accelerate algorithm convergence. A general-class of iterative algorithms have been implemented on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers up to 154× speedup versus a standard Hadoop-based CPU-workstation. Improved performance is achieved by clusters of FPGAs.
{"title":"Accelerating iterative algorithms with asynchronous accumulative updates on FPGAs","authors":"D. Unnikrishnan, S. Virupaksha, Lekshmi Krishnan, Lixin Gao, R. Tessier","doi":"10.1109/FPT.2013.6718332","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718332","url":null,"abstract":"Iterative algorithms represent a pervasive class of data mining, web search and scientific computing applications. In iterative algorithms, a final result is derived by performing repetitive computations on an input data set. Existing techniques to parallelize such algorithms typically use software frameworks such as MapReduce and Hadoop to distribute data for an iteration across multiple CPU-based workstations in a cluster and collect per-iteration results. These platforms are marked by the need to synchronize data computations at iteration boundaries, impeding system performance. In this paper, we demonstrate that FPGAs in distributed computing systems can serve a vital role in breaking this synchronization barrier with the help of asynchronous accumulative updates. These updates allow for the accumulation of intermediate results for numerous data points without the need for iteration-based barriers allowing individual nodes in a cluster to independently make progress towards the final outcome. Computation is dynamically prioritized to accelerate algorithm convergence. A general-class of iterative algorithms have been implemented on a cluster of four FPGAs. A speedup of 7× is achieved over an implementation of asynchronous accumulative updates on a general-purpose CPU. The system offers up to 154× speedup versus a standard Hadoop-based CPU-workstation. Improved performance is achieved by clusters of FPGAs.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121053841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718363
D. Seemuth, Katherine Morrow
Embedded systems often contain many components, some with multiple Field Programmable Gate Arrays (FPGAs). Designing Printed Circuit Boards (PCBs) for these systems can be a complex process that is often tedious, error-prone, and time-intensive. Existing computer-aided design tools require designers to manually insert components and explicitly define the connections between every component on the PCB - a cumbersome process. A fast PCB design framework requiring reduced designer time and effort would be particularly advantageous for rapid prototyping and short production run PCBs. Therefore, this paper proposes a novel, freely-available open-source framework to capture design intent and automatically implement the design details. Designers express connectivity at a higher level of abstraction than enumerating or drawing each individual trace between components. Given the components and connection requirements, the proposed framework automatically generates component placements, I/O voltage supply assignments, and FPGA pin assignments to minimize trace length. We also propose a novel method to improve trace length estimations during placement, before FPGA pins have actually been assigned to those connections. The proposed framework quickly explores large solution spaces, enabling rapid prototyping and design space exploration, and can lead to lower costs in design time and other non-recurring expenses. We demonstrate that it produces favorable results for various design requirements, which suggests the framework will be especially appreciated by designers of systems with multiple FPGAs having large numbers of flexible pins.
{"title":"Automated multi-device placement, I/O voltage supply assignment, and pin assignment in circuit board design","authors":"D. Seemuth, Katherine Morrow","doi":"10.1109/FPT.2013.6718363","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718363","url":null,"abstract":"Embedded systems often contain many components, some with multiple Field Programmable Gate Arrays (FPGAs). Designing Printed Circuit Boards (PCBs) for these systems can be a complex process that is often tedious, error-prone, and time-intensive. Existing computer-aided design tools require designers to manually insert components and explicitly define the connections between every component on the PCB - a cumbersome process. A fast PCB design framework requiring reduced designer time and effort would be particularly advantageous for rapid prototyping and short production run PCBs. Therefore, this paper proposes a novel, freely-available open-source framework to capture design intent and automatically implement the design details. Designers express connectivity at a higher level of abstraction than enumerating or drawing each individual trace between components. Given the components and connection requirements, the proposed framework automatically generates component placements, I/O voltage supply assignments, and FPGA pin assignments to minimize trace length. We also propose a novel method to improve trace length estimations during placement, before FPGA pins have actually been assigned to those connections. The proposed framework quickly explores large solution spaces, enabling rapid prototyping and design space exploration, and can lead to lower costs in design time and other non-recurring expenses. We demonstrate that it produces favorable results for various design requirements, which suggests the framework will be especially appreciated by designers of systems with multiple FPGAs having large numbers of flexible pins.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126986581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}