首页 > 最新文献

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
A LUT-Based Approximate Adder 基于lut的近似加法器
Andreas Becher, Jorge Echavarria, Daniel Ziener, S. Wildermann, J. Teich
In this paper, we propose a novel approximate adder structure for LUT-based FPGA technology. Compared with a full featured accurate carry-ripple adder, the longest path is significantly shortened which enables the clocking with an increased clock frequency. By using the proposed adder structure, the throughput of an FPGA-based implementation can be significantly increased. On the other hand, the resulting average error can be reduced compared to similar approaches for ASIC implementations.
在本文中,我们为基于lut的FPGA技术提出了一种新的近似加法器结构。与全功能精确的携带纹波加法器相比,最长路径显着缩短,从而使时钟频率增加。通过使用所提出的加法器结构,可以显著提高基于fpga的实现的吞吐量。另一方面,与ASIC实现的类似方法相比,得到的平均误差可以减少。
{"title":"A LUT-Based Approximate Adder","authors":"Andreas Becher, Jorge Echavarria, Daniel Ziener, S. Wildermann, J. Teich","doi":"10.1109/FCCM.2016.16","DOIUrl":"https://doi.org/10.1109/FCCM.2016.16","url":null,"abstract":"In this paper, we propose a novel approximate adder structure for LUT-based FPGA technology. Compared with a full featured accurate carry-ripple adder, the longest path is significantly shortened which enables the clocking with an increased clock frequency. By using the proposed adder structure, the throughput of an FPGA-based implementation can be significantly increased. On the other hand, the resulting average error can be reduced compared to similar approaches for ASIC implementations.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123039851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Evaluating Embedded FPGA Accelerators for Deep Learning Applications 评估用于深度学习应用的嵌入式FPGA加速器
Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Vamsi Buddha, Nachiket Kapre
FPGA-based embedded soft vector processors can exceed the performance and energy-efficiency of embedded GPUs and DSPs for lightweight deep learning applications. For low complexity deep neural networks targeting resource constrained platforms, we develop optimized Caffe-compatible deep learning library routines that target a range of embedded accelerator-based systems between 4 -- 8 W power budgets such as the Xilinx Zedboard (with MXP soft vector processor), NVIDIA Jetson TK1 (GPU), InForce 6410 (DSP), TI EVM5432 (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). For MNIST (28×28 images) and CIFAR10 (32×32 images), the deep layer structure is amenable to MXP-enhanced FPGA mappings to deliver 1.4 -- 5× higher energy efficiency than all other platforms. Not surprisingly, embedded GPU works better for complex networks with large image resolutions.
基于fpga的嵌入式软矢量处理器可以在轻量级深度学习应用中超越嵌入式gpu和dsp的性能和能效。对于针对资源受限平台的低复杂性深度神经网络,我们开发了优化的caffe兼容深度学习库例程,针对一系列基于4 - 8 W功率预算的嵌入式加速器系统,如Xilinx Zedboard(带有MXP软矢量处理器),NVIDIA Jetson TK1 (GPU), InForce 6410 (DSP), TI EVM5432 (DSP)以及Adapteva parallelella板(带有NoC的自定义多核)。对于MNIST (28×28 images)和CIFAR10 (32×32 images),深层结构适用于mxp增强的FPGA映射,提供比所有其他平台高1.4 - 5倍的能效。毫不奇怪,嵌入式GPU更适合具有大图像分辨率的复杂网络。
{"title":"Evaluating Embedded FPGA Accelerators for Deep Learning Applications","authors":"Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Vamsi Buddha, Nachiket Kapre","doi":"10.1109/FCCM.2016.14","DOIUrl":"https://doi.org/10.1109/FCCM.2016.14","url":null,"abstract":"FPGA-based embedded soft vector processors can exceed the performance and energy-efficiency of embedded GPUs and DSPs for lightweight deep learning applications. For low complexity deep neural networks targeting resource constrained platforms, we develop optimized Caffe-compatible deep learning library routines that target a range of embedded accelerator-based systems between 4 -- 8 W power budgets such as the Xilinx Zedboard (with MXP soft vector processor), NVIDIA Jetson TK1 (GPU), InForce 6410 (DSP), TI EVM5432 (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). For MNIST (28×28 images) and CIFAR10 (32×32 images), the deep layer structure is amenable to MXP-enhanced FPGA mappings to deliver 1.4 -- 5× higher energy efficiency than all other platforms. Not surprisingly, embedded GPU works better for complex networks with large image resolutions.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125640498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Spatial Predicates Evaluation in the Geohash Domain Using Reconfigurable Hardware 基于可重构硬件的Geohash域空间谓词评估
Dajung Lee, R. Moussalli, S. Asaad, M. Srivatsa
As location sensing devices are becoming ubiquitous, overwhelming amounts of data are being produced by the Internet-of-Things-That-Move. Though analyzing this data presents significant business opportunities, new techniques are needed to attain adequate levels of processing performance. One example is the recently introduced geohash geographical coordinate system that is mainly used for indexing. While geohash codes provide useful inherent properties such as hierarchical and variable-precision coding, traditional spatial algorithms operate on data represented using the conventional latitude/longitude geographical coordinate system, and as such do not take advantage of geohash coding. This paper tackles the evaluation of spatial predicates on geometries defined in the geohash domain, as an alternative to the standard Dimensionally Extended Nine-Intersection Model (DE-9IM). We present the first hardware architecture to efficiently evaluate "contain" and "touch" (internal, external, corner) relations between streams of pairs of geohash codes, in a high throughput (no stall) fashion. Employing FPGAs for exploiting the bit-level granularity of geohash codes, experimental results show (end-to-end) speedup of more than 20× and 90× over highly optimized single-threaded DE-9IM implementations of the contain and touch predicates, respectively. Furthermore, the PCIe-bound FPGA-based solution outperforms a geohash-based multithreaded CPU implementation by ≈1.8× (touch predicate) while using minimal FPGA resources.
随着位置传感设备变得无处不在,移动物联网正在产生大量数据。虽然分析这些数据提供了重要的商业机会,但需要新的技术来达到适当的处理性能水平。一个例子是最近引入的geohash地理坐标系统,它主要用于索引。虽然geohash编码提供了有用的固有属性,如分层和可变精度编码,但传统的空间算法对使用传统纬度/经度地理坐标系统表示的数据进行操作,因此不能利用geohash编码。本文讨论了geohash域中定义的几何图形上的空间谓词的评估,作为标准维度扩展九相交模型(DE-9IM)的替代方案。我们提出了第一个硬件架构,以高吞吐量(无失速)的方式有效地评估geohash码对流之间的“包含”和“触摸”(内部,外部,角落)关系。实验结果显示,与高度优化的单线程DE-9IM实现的包含谓词和触摸谓词相比,使用fpga开发geohash码的位级粒度,(端到端)加速分别超过20倍和90倍。此外,基于pcie绑定的FPGA解决方案在使用最小FPGA资源的情况下,比基于geohash的多线程CPU实现高出约1.8倍(触摸谓词)。
{"title":"Spatial Predicates Evaluation in the Geohash Domain Using Reconfigurable Hardware","authors":"Dajung Lee, R. Moussalli, S. Asaad, M. Srivatsa","doi":"10.1109/FCCM.2016.51","DOIUrl":"https://doi.org/10.1109/FCCM.2016.51","url":null,"abstract":"As location sensing devices are becoming ubiquitous, overwhelming amounts of data are being produced by the Internet-of-Things-That-Move. Though analyzing this data presents significant business opportunities, new techniques are needed to attain adequate levels of processing performance. One example is the recently introduced geohash geographical coordinate system that is mainly used for indexing. While geohash codes provide useful inherent properties such as hierarchical and variable-precision coding, traditional spatial algorithms operate on data represented using the conventional latitude/longitude geographical coordinate system, and as such do not take advantage of geohash coding. This paper tackles the evaluation of spatial predicates on geometries defined in the geohash domain, as an alternative to the standard Dimensionally Extended Nine-Intersection Model (DE-9IM). We present the first hardware architecture to efficiently evaluate \"contain\" and \"touch\" (internal, external, corner) relations between streams of pairs of geohash codes, in a high throughput (no stall) fashion. Employing FPGAs for exploiting the bit-level granularity of geohash codes, experimental results show (end-to-end) speedup of more than 20× and 90× over highly optimized single-threaded DE-9IM implementations of the contain and touch predicates, respectively. Furthermore, the PCIe-bound FPGA-based solution outperforms a geohash-based multithreaded CPU implementation by ≈1.8× (touch predicate) while using minimal FPGA resources.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114099172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Finding Space-Time Stream Permutations for Minimum Memory and Latency 寻找最小内存和延迟的时空流排列
Thaddeus Koehn, P. Athanas
Processing of parallel data streams requires permutation units for many algorithms where the streams are not independent. Such algorithms include transforms, multi-rate signal processing, and Viterbi decoding. The absolute order of data elements from the permutation is not important, only that data elements are located correctly for the next processing step. This paper describes a method to find permutations that require a minimum amount of memory and latency. The required permutations are generated based on the data dependencies of a computation set. Additional constraints are imposed so that the parallel streaming architecture processes the data without flow control. Results show agreement with brute force methods, which become computationally infeasible for large permutation sets.
处理并行数据流需要对许多算法的排列单元,其中流不是独立的。这些算法包括变换、多速率信号处理和维特比解码。排列中数据元素的绝对顺序并不重要,重要的是为下一个处理步骤正确定位数据元素。本文描述了一种寻找需要最小内存和延迟的排列的方法。所需的排列是基于计算集的数据依赖关系生成的。附加的约束使得并行流架构在没有流控制的情况下处理数据。结果与蛮力方法一致,这种方法在计算上对于大排列集是不可行的。
{"title":"Finding Space-Time Stream Permutations for Minimum Memory and Latency","authors":"Thaddeus Koehn, P. Athanas","doi":"10.1109/FCCM.2016.54","DOIUrl":"https://doi.org/10.1109/FCCM.2016.54","url":null,"abstract":"Processing of parallel data streams requires permutation units for many algorithms where the streams are not independent. Such algorithms include transforms, multi-rate signal processing, and Viterbi decoding. The absolute order of data elements from the permutation is not important, only that data elements are located correctly for the next processing step. This paper describes a method to find permutations that require a minimum amount of memory and latency. The required permutations are generated based on the data dependencies of a computation set. Additional constraints are imposed so that the parallel streaming architecture processes the data without flow control. Results show agreement with brute force methods, which become computationally infeasible for large permutation sets.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115796634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Continuous Online Self-Monitoring Introspection Circuitry for Timing Repair by Incremental Partial-Reconfiguration (COSMIC TRIP) 增量部分重构定时修复连续在线自监测自省电路(COSMIC TRIP)
Hans Giesen, Benjamin Gojman, Raphael Rubin, Ji Kim, A. DeHon
We show that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows an FPGA to detect and self-adapt to aging and environmental effects on timing. Using a lightweight (<;4% added area) mechanism for monitoring transition timing, a Difference Detector with First-Fail Latch, we can estimate the timing margin on circuits and identify the individual links that have degraded and whose delay is determining the worst-case circuit delay. Combined with Choose-Your-own-Adventure precomputed, fine-grained repair alternatives, we introduce a strategy for rapid, in-system incremental repair of links with degraded timing. We show that these techniques allow us to respond to a single aging event in less than 300 ms for the toronto20 benchmarks. The result is a step toward systems where adaptive reconfiguration on the time-scale of seconds is viable and beneficial.
我们表明,在运行期间连续监测lut到lut链路级别的片上延迟允许FPGA检测并自适应老化和环境对时序的影响。使用轻量级(<;4%的增加面积)机制来监控转换时间,一个带有首次失败锁存器的差分检测器,我们可以估计电路上的时间裕度,并识别已经降级的单个链路,其延迟决定了最坏情况下的电路延迟。结合“选择你自己的冒险”预先计算的、细粒度的修复方案,我们引入了一种快速的、系统内的、定时退化的链路增量修复策略。我们表明,这些技术使我们能够在多伦多20基准测试中在不到300毫秒的时间内响应单个老化事件。结果是朝着在秒级时间尺度上的自适应重构是可行和有益的系统迈出了一步。
{"title":"Continuous Online Self-Monitoring Introspection Circuitry for Timing Repair by Incremental Partial-Reconfiguration (COSMIC TRIP)","authors":"Hans Giesen, Benjamin Gojman, Raphael Rubin, Ji Kim, A. DeHon","doi":"10.1145/3158229","DOIUrl":"https://doi.org/10.1145/3158229","url":null,"abstract":"We show that continuously monitoring on-chip delays at the LUT-to-LUT link level during operation allows an FPGA to detect and self-adapt to aging and environmental effects on timing. Using a lightweight (<;4% added area) mechanism for monitoring transition timing, a Difference Detector with First-Fail Latch, we can estimate the timing margin on circuits and identify the individual links that have degraded and whose delay is determining the worst-case circuit delay. Combined with Choose-Your-own-Adventure precomputed, fine-grained repair alternatives, we introduce a strategy for rapid, in-system incremental repair of links with degraded timing. We show that these techniques allow us to respond to a single aging event in less than 300 ms for the toronto20 benchmarks. The result is a step toward systems where adaptive reconfiguration on the time-scale of seconds is viable and beneficial.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116072188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication 全流水线的能量效率:一个矩阵乘法的案例研究
Peipei Zhou, Hyunseok Park, Zhenman Fang, J. Cong, A. DeHon
Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.
定制的管道设计,最大限度地减少了管道启动间隔(II),最大限度地提高了FPGA加速器的吞吐量,设计了高级合成(HLS)。最小化II对能源效率的影响是什么?使用矩阵乘加速器,我们表明,由于互连节省,矩阵乘II>1有时可以将动态能量降低到II=1以下,但II=1总是达到接近最小值的能量。我们还确定了商业工具流中低效映射的来源。
{"title":"Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication","authors":"Peipei Zhou, Hyunseok Park, Zhenman Fang, J. Cong, A. DeHon","doi":"10.1109/FCCM.2016.50","DOIUrl":"https://doi.org/10.1109/FCCM.2016.50","url":null,"abstract":"Customized pipeline designs that minimize the pipeline initiation interval (II) maximize the throughput of FPGA accelerators designed with high-level synthesis (HLS). What is the impact of minimizing II on energy efficiency? Using a matrix-multiply accelerator, we show that matrix multiplies with II>1 can sometimes reduce dynamic energy below II=1 due to interconnect savings, but II=1 always achieves energy close to the minimum. We also identify sources of inefficient mapping in the commercial tool flow.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123185976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Cost Effective Partial Scan for Hardware Emulation 成本有效的部分扫描硬件仿真
Tao Li, Qiang Liu
FPGA-based hardware emulation platform runs significantly faster than software simulation for verifying complex circuit designs. However, the controllability and observability of circuit internal signals mapped onto FPGAs are restricted due to the limited chip pins. Scan chain-based technique is effective in providing full-chip controllability and observability, at the cost of large area overhead, especially for FPGAs. Therefore, partial scan has been proposed as an alternative way to improve the controllability and observability while reducing the area cost. However, the optimized partial scan solution with the minimum number of scan flip-flops is not always found. This paper formulates the classical balanced structure partial scan procedure in one step as an integer linear programming problem, leading to the optimized partial scan solution. In addition, partially used logic resources in FPGAs are exploited to implement the extra logic required by the scan chain, to further reduce the area cost. Experimental results show that our partial scan approach can reduce the area overhead by 78.6% and 16.6% compared to the full scan and the existing partial scan approach.
在验证复杂电路设计时,基于fpga的硬件仿真平台运行速度明显快于软件仿真。然而,由于芯片引脚有限,电路内部信号映射到fpga的可控性和可观察性受到限制。扫描链技术可以有效地提供全芯片的可控性和可观察性,但代价是大面积开销,特别是对于fpga。因此,部分扫描被提出作为一种替代方法,以提高可控性和可观测性,同时降低面积成本。然而,具有最少扫描触发器数的部分扫描优化解并不总是存在的。本文将经典平衡结构局部扫描过程一步化为一个整数线性规划问题,从而得到局部扫描的最优解。此外,利用fpga中部分使用的逻辑资源来实现扫描链所需的额外逻辑,进一步降低了面积成本。实验结果表明,与全扫描和现有的部分扫描方法相比,我们的部分扫描方法可以分别减少78.6%和16.6%的面积开销。
{"title":"Cost Effective Partial Scan for Hardware Emulation","authors":"Tao Li, Qiang Liu","doi":"10.1109/FCCM.2016.39","DOIUrl":"https://doi.org/10.1109/FCCM.2016.39","url":null,"abstract":"FPGA-based hardware emulation platform runs significantly faster than software simulation for verifying complex circuit designs. However, the controllability and observability of circuit internal signals mapped onto FPGAs are restricted due to the limited chip pins. Scan chain-based technique is effective in providing full-chip controllability and observability, at the cost of large area overhead, especially for FPGAs. Therefore, partial scan has been proposed as an alternative way to improve the controllability and observability while reducing the area cost. However, the optimized partial scan solution with the minimum number of scan flip-flops is not always found. This paper formulates the classical balanced structure partial scan procedure in one step as an integer linear programming problem, leading to the optimized partial scan solution. In addition, partially used logic resources in FPGAs are exploited to implement the extra logic required by the scan chain, to further reduce the area cost. Experimental results show that our partial scan approach can reduce the area overhead by 78.6% and 16.6% compared to the full scan and the existing partial scan approach.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120853874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms 异构计算平台上高效加速基因组短读映射
Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars
We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.
我们提出了一种新的fpga加速BWA-MEM实现,这是一种流行的基因组数据制图工具。将单个Xilinx Virtex-7 Alpha Data外接卡上的FPGA实现的性能和功耗与纯软件基准系统进行了比较。通过将Seed Extension阶段卸载到FPGA上,实现了整体应用级性能的两倍加速和1.6倍的功率效率增益。为了方便与平台和工具无关的比较,引入了每焦耳单位的碱基对作为功率效率的度量。FPGA设计能够每焦耳映射多达34000个碱基对。
{"title":"Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms","authors":"Ernst Houtgast, V. Sima, G. Marchiori, K. Bertels, Z. Al-Ars","doi":"10.1109/FCCM.2016.17","DOIUrl":"https://doi.org/10.1109/FCCM.2016.17","url":null,"abstract":"We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125688409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Vertex-Centric Graph Processing on FPGA 基于FPGA的顶点中心图处理
Nina Engelhardt, Hayden Kwok-Hay So
Past research and implementation efforts have shown that FPGAs are efficient at processing many graph algorithms. However, they are notoriously hard to program, leading to impractically long development times even for simple applications. We propose a vertex-centric framework for graph processing on FPGAs, providing a base execution model and distributed architecture so that developers need only write very small application kernels.
过去的研究和实现工作表明fpga在处理许多图算法方面是有效的。然而,众所周知,它们很难编程,即使对于简单的应用程序,也会导致不切实际的长开发时间。我们提出了一个以顶点为中心的fpga图形处理框架,提供了一个基本的执行模型和分布式架构,这样开发人员只需要编写非常小的应用程序内核。
{"title":"Vertex-Centric Graph Processing on FPGA","authors":"Nina Engelhardt, Hayden Kwok-Hay So","doi":"10.1109/FCCM.2016.31","DOIUrl":"https://doi.org/10.1109/FCCM.2016.31","url":null,"abstract":"Past research and implementation efforts have shown that FPGAs are efficient at processing many graph algorithms. However, they are notoriously hard to program, leading to impractically long development times even for simple applications. We propose a vertex-centric framework for graph processing on FPGAs, providing a base execution model and distributed architecture so that developers need only write very small application kernels.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127765494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
DeCO: A DSP Block Based FPGA Accelerator Overlay with Low Overhead Interconnect DeCO:一种基于DSP块的FPGA加速器覆盖和低开销互连
A. Jain, Xiangwei Li, P. Singhai, D. Maskell, Suhaib A. Fahmy
Coarse-grained FPGA overlay architectures paired with general purpose processors offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity. However, the area overheads of these overlays, and in particular architectures with island-style interconnect, negate many of these advantages, preventing their use in practical FPGA-based systems. Crucially, the interconnect flexibility provided by these overlay architectures is normally over-provisioned for accelerators based on feed-forward pipelined datapaths, which in many cases have the general shape of inverted cones. We propose DeCO, a cone shaped cluster of FUs utilizing a simple linear interconnect between them. This reduces the area overheads for implementing compute kernels extracted from compute-intensive applications represented as directed acyclic dataflow graphs, while still allowing high data throughput. We perform design space exploration by modeling programmability overhead as a function of overlay design parameters, and compare to the programmability overhead of island-style overlays. We observe 87% savings in LUT requirements using the proposed approach compared to DSP block based island-style overlays. Our experimental evaluation shows that the proposed overlay exhibits an achievable frequency of 395 MHz, close to the DSP theoretical limit on the Xilinx Zynq. We also present an automated tool flow that provides a rapid and vendor-independent mapping of the high level compute kernel code to the proposed overlay.
粗粒度FPGA覆盖体系结构与通用处理器相结合,为通用硬件加速提供了许多优势,因为它具有类似软件的可编程性、快速编译、应用程序可移植性和改进的设计生产力。然而,这些覆盖层的面积开销,特别是具有岛式互连的架构,抵消了许多这些优势,阻碍了它们在实际的基于fpga的系统中的使用。至关重要的是,这些覆盖架构提供的互连灵活性通常被过度提供给基于前馈流水线数据路径的加速器,这些加速器在许多情况下具有倒锥的一般形状。我们提出了DeCO,一个锥形的FUs集群,它们之间利用简单的线性互连。这减少了实现从计算密集型应用程序(表示为有向无循环数据流图)中提取的计算内核的面积开销,同时仍然允许高数据吞吐量。我们通过将可编程性开销建模为覆盖层设计参数的函数来进行设计空间探索,并与海岛式覆盖层的可编程性开销进行比较。我们观察到,与基于DSP块的岛式覆盖相比,使用所提出的方法可以节省87%的LUT需求。我们的实验评估表明,所提出的覆盖具有395 MHz的可实现频率,接近Xilinx Zynq上DSP的理论极限。我们还提出了一个自动化的工具流,它提供了一个快速的、独立于供应商的高级计算内核代码到所提议的覆盖层的映射。
{"title":"DeCO: A DSP Block Based FPGA Accelerator Overlay with Low Overhead Interconnect","authors":"A. Jain, Xiangwei Li, P. Singhai, D. Maskell, Suhaib A. Fahmy","doi":"10.1109/FCCM.2016.10","DOIUrl":"https://doi.org/10.1109/FCCM.2016.10","url":null,"abstract":"Coarse-grained FPGA overlay architectures paired with general purpose processors offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity. However, the area overheads of these overlays, and in particular architectures with island-style interconnect, negate many of these advantages, preventing their use in practical FPGA-based systems. Crucially, the interconnect flexibility provided by these overlay architectures is normally over-provisioned for accelerators based on feed-forward pipelined datapaths, which in many cases have the general shape of inverted cones. We propose DeCO, a cone shaped cluster of FUs utilizing a simple linear interconnect between them. This reduces the area overheads for implementing compute kernels extracted from compute-intensive applications represented as directed acyclic dataflow graphs, while still allowing high data throughput. We perform design space exploration by modeling programmability overhead as a function of overlay design parameters, and compare to the programmability overhead of island-style overlays. We observe 87% savings in LUT requirements using the proposed approach compared to DSP block based island-style overlays. Our experimental evaluation shows that the proposed overlay exhibits an achievable frequency of 395 MHz, close to the DSP theoretical limit on the Xilinx Zynq. We also present an automated tool flow that provides a rapid and vendor-independent mapping of the high level compute kernel code to the proposed overlay.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116889538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
期刊
2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1