2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文中文

Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA 基于FPGA的在线决策树学习的高效可扩展加速研究

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00032

Zhe Lin, Sharad Sinha, Wei Zhang

Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples and providing inference results. However, even the most up-to-date online tree learning algorithms still suffer from either high memory usage or high computational intensity with dependency and long latency, making them challenging to implement in hardware. To overcome these difficulties, we introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models. The proposed algorithm is light-weight in terms of both memory and computational demand, while still maintaining high generalization ability. A series of optimization techniques dedicated to the proposed algorithm have been investigated from the hardware perspective, including coarse-grained and fine-grained parallelism, dynamic and memory-based resource sharing, pipelining with data forwarding. We further present a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques. Experimental results show that our proposed algorithm outperforms the state-of-the-art Hoeffding tree learning method, leading to 0.05% to 12.3% improvement in inference accuracy. Real implementation of the complete learning system on the FPGA demonstrates a 384x to 1581x speedup in execution time over the state-of-the-art design.

决策树是在各种应用场景中常用的机器学习模型。在大数据时代，传统的决策树归纳算法对数据存储要求严格，不适合学习大规模数据集。在线决策树学习算法通过对输入样本进行并行训练并提供推理结果来解决这一问题。然而，即使是最新的在线树学习算法仍然存在高内存使用或高计算强度、依赖性和长延迟的问题，这使得它们很难在硬件上实现。为了克服这些困难，我们引入了一种新的基于分位数的算法来改进Hoeffding树的归纳，Hoeffding树是最先进的在线学习模型之一。该算法在内存和计算量方面都是轻量级的，同时仍保持较高的泛化能力。从硬件角度研究了该算法的一系列优化技术，包括粗粒度和细粒度并行、动态和基于内存的资源共享、数据转发的流水线。我们进一步提出了一个高性能，硬件高效和可扩展的在线决策树学习系统在现场可编程门阵列(FPGA)与系统级优化技术。实验结果表明，我们提出的算法优于最先进的Hoeffding树学习方法，推理准确率提高了0.05%至12.3%。完整的学习系统在FPGA上的实际实现表明，与最先进的设计相比，执行时间加快了384倍到1581倍。

{"title":"Towards Efficient and Scalable Acceleration of Online Decision Tree Learning on FPGA","authors":"Zhe Lin, Sharad Sinha, Wei Zhang","doi":"10.1109/FCCM.2019.00032","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00032","url":null,"abstract":"Decision trees are machine learning models commonly used in various application scenarios. In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets due to their stringent data storage requirement. Online decision tree learning algorithms have been devised to tackle this problem by concurrently training with incoming samples and providing inference results. However, even the most up-to-date online tree learning algorithms still suffer from either high memory usage or high computational intensity with dependency and long latency, making them challenging to implement in hardware. To overcome these difficulties, we introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models. The proposed algorithm is light-weight in terms of both memory and computational demand, while still maintaining high generalization ability. A series of optimization techniques dedicated to the proposed algorithm have been investigated from the hardware perspective, including coarse-grained and fine-grained parallelism, dynamic and memory-based resource sharing, pipelining with data forwarding. We further present a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array (FPGA) with system-level optimization techniques. Experimental results show that our proposed algorithm outperforms the state-of-the-art Hoeffding tree learning method, leading to 0.05% to 12.3% improvement in inference accuracy. Real implementation of the complete learning system on the FPGA demonstrates a 384x to 1581x speedup in execution time over the state-of-the-art design.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130283946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

KPynq: A Work-Efficient Triangle-Inequality Based K-Means on FPGA 基于三角不等式的高效K-Means FPGA

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00061

Yuke Wang, Zhaorui Zeng, Boyuan Feng, Lei Deng, Yufei Ding

K-means is a popular but computation-intensive algorithm for unsupervised learning. To address this issue, we present KPynq, a work-efficient triangle-inequality based K-means on FPGA for handling large-size, high-dimension datasets. KPynq leverages an algorithm-level optimization to balance the performance and computation irregularity, and a hardware architecture design to fully exploit the pipeline and parallel processing capability of various FPGAs. In the experiment, KPynq consistently outperforms the CPU-based standard K-means in terms of its speedup (up to 4.2x) and significant energy efficiency (up to 218x).

K-means是一种流行但计算密集型的无监督学习算法。为了解决这个问题，我们提出了KPynq，一种基于三角不等式的高效K-means FPGA，用于处理大尺寸、高维数据集。KPynq利用算法级优化来平衡性能和计算不规则性，并利用硬件架构设计来充分利用各种fpga的流水线和并行处理能力。在实验中，KPynq在加速(高达4.2倍)和显著的能源效率(高达218倍)方面始终优于基于cpu的标准K-means。

引用次数: 2

SimBNN: A Similarity-Aware Binarized Neural Network Acceleration Framework SimBNN:一个相似度感知的二值化神经网络加速框架

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00060

Cheng Fu, Shilin Zhu, Huili Chen, F. Koushanfar, Hao Su, Jishen Zhao

Binarized Neural Networks (BNNs) eliminate bitwidth redundancy in Convolutional Neural Networks (CNNs) by using a single bit (-1/+1) for network parameters and intermediate representations. This greatly reduces off-chip data transfer and storage overhead. However, considerable computation redundancy remains in BNN inference. To tackle this problem, we investigate the similarity property in input data and kernel weights. We identify an average of 79% input similarity and 61% kernel similarity measured by our proposed metric across common network architectures. Motivated by this observation, we propose SimBNN, a fast and energy-efficient acceleration framework for BNN inference that leverages similarity properties. SimBNN consists of a set of similarity-aware accelerators, a weight reuse optimization algorithm, and a similarity selection mechanism. SimBNN incorporates two types of BNN accelerators, which exploit the input similarity and kernel similarity, respectively. More specifically, the result from the previous stage is reused if similarity is identified, thus significantly reducing BNN computation overhead. Furthermore, we propose a weight reuse optimization algorithm, which increases the weight similarity by off-line re-ordering weight kernels. Finally, our framework provides a systematic method to determine the optimal strategy between input data and kernel weights reuse, based on the similarity characteristics of input data and pre-trained BNNs.

二值化神经网络(bnn)通过使用单个比特(-1/+1)表示网络参数和中间表示来消除卷积神经网络(cnn)中的位宽冗余。这大大减少了片外数据传输和存储开销。然而，在BNN推理中仍然存在相当大的计算冗余。为了解决这个问题，我们研究了输入数据和核权值的相似性。我们确定了79%的输入相似度和61%的内核相似度，这些相似度通过我们提出的指标在常见的网络架构中测量出来。基于这一观察结果，我们提出了SimBNN，这是一种利用相似性特性的快速节能的BNN推理加速框架。SimBNN由一组相似度感知加速器、权重重用优化算法和相似度选择机制组成。SimBNN包含两种类型的BNN加速器，它们分别利用输入相似度和核相似度。更具体地说，如果识别出相似性，则重用前一阶段的结果，从而大大减少了BNN计算开销。此外，我们提出了一种权值重用优化算法，该算法通过离线重排序权值核来提高权值相似度。最后，我们的框架提供了一种系统的方法来确定输入数据和核权重用之间的最优策略，基于输入数据和预训练的bnn的相似性特征。

{"title":"SimBNN: A Similarity-Aware Binarized Neural Network Acceleration Framework","authors":"Cheng Fu, Shilin Zhu, Huili Chen, F. Koushanfar, Hao Su, Jishen Zhao","doi":"10.1109/FCCM.2019.00060","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00060","url":null,"abstract":"Binarized Neural Networks (BNNs) eliminate bitwidth redundancy in Convolutional Neural Networks (CNNs) by using a single bit (-1/+1) for network parameters and intermediate representations. This greatly reduces off-chip data transfer and storage overhead. However, considerable computation redundancy remains in BNN inference. To tackle this problem, we investigate the similarity property in input data and kernel weights. We identify an average of 79% input similarity and 61% kernel similarity measured by our proposed metric across common network architectures. Motivated by this observation, we propose SimBNN, a fast and energy-efficient acceleration framework for BNN inference that leverages similarity properties. SimBNN consists of a set of similarity-aware accelerators, a weight reuse optimization algorithm, and a similarity selection mechanism. SimBNN incorporates two types of BNN accelerators, which exploit the input similarity and kernel similarity, respectively. More specifically, the result from the previous stage is reused if similarity is identified, thus significantly reducing BNN computation overhead. Furthermore, we propose a weight reuse optimization algorithm, which increases the weight similarity by off-line re-ordering weight kernels. Finally, our framework provides a systematic method to determine the optimal strategy between input data and kernel weights reuse, based on the similarity characteristics of input data and pre-trained BNNs.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122991661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Welcome Message from the General and Program Chairs 总主委及计划主委致欢迎辞

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/fccm.2019.00005

引用次数: 0

Safe Task Interruption for FPGAs fpga的安全任务中断

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00070

Sameh Attia, Vaughn Betz

Saving and restoring the state of an FPGA task in an orderly manner is essential for enabling hardware checkpointing and context switching. However, it requires task interruption, and stopping a task at an arbitrary time can cause several hazards including deadlock and data loss. In this work, we build a context switching simulator to simulate and identify these hazards. In addition, we introduce design rules that should be followed to achieve safe task interruption, and propose task wrappers that can be placed around an FPGA task to implement these rules.

以有序的方式保存和恢复FPGA任务的状态对于启用硬件检查点和上下文切换至关重要。但是，它需要任务中断，并且在任意时间停止任务可能会导致死锁和数据丢失等几种危险。在这项工作中，我们建立了一个上下文切换模拟器来模拟和识别这些危害。此外，我们介绍了应该遵循的设计规则，以实现安全的任务中断，并提出了可以放置在FPGA任务周围的任务包装器来实现这些规则。

引用次数: 1

Impact of FPGA Architecture on Area and Performance of CGRA Overlays FPGA架构对CGRA覆盖面积和性能的影响

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00022

Ian Taras, J. Anderson

Coarse-grained reconfigurable arrays (CGRAs) are programmable logic devices with ALU-style processing elements and datapath interconnect. CGRAs can be realized as custom ASICs or implemented on FPGAs as overlays . A key element of CGRAs is that they are typically software programmable with rapid compile times – an advantage arising from their coarse-grained characteristics, simplifying CAD mapping tasks. We implement two previously published CGRAs as overlays on two commercial FPGAs (Intel and Xilinx), and consider the impact of the underlying FPGA architecture on the CGRA area and performance. We present optimizations for the overlays to take advantage of the FPGA architectural features and show a peak performance improvement of 1.93x, as well as maximum area savings of 31.1% and 48.5% for Intel and Xilinx, respectively, relative to a naive first-cut implementation. We also present a novel technique for a configurable multiplexer implementation, which embeds the select signals into SRAM configuration, saving 35.7% in area. The research is conducted using the open-source CGRA-ME (modeling and exploration) framework [1].

粗粒度可重构阵列(CGRAs)是具有alu风格处理元素和数据路径互连的可编程逻辑器件。CGRAs可以作为自定义asic实现，也可以作为覆盖层在fpga上实现。CGRAs的一个关键元素是，它们通常是软件可编程的，编译时间很快——这一优势来自于它们的粗粒度特性，简化了CAD映射任务。我们将两个先前发布的CGRA作为两个商用FPGA (Intel和Xilinx)的覆盖实现，并考虑底层FPGA架构对CGRA面积和性能的影响。我们对覆盖层进行了优化，以利用FPGA架构特性，并显示出峰值性能提高了1.93倍，Intel和Xilinx的最大面积节省分别为31.1%和48.5%，相对于简单的首切实现。我们还提出了一种新的可配置多路复用器实现技术，该技术将选择的信号嵌入到SRAM配置中，节省了35.7%的面积。本研究采用开源的CGRA-ME(建模与探索)框架[1]。

{"title":"Impact of FPGA Architecture on Area and Performance of CGRA Overlays","authors":"Ian Taras, J. Anderson","doi":"10.1109/FCCM.2019.00022","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00022","url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) are programmable logic devices with ALU-style processing elements and datapath interconnect. CGRAs can be realized as custom ASICs or implemented on FPGAs as overlays . A key element of CGRAs is that they are typically software programmable with rapid compile times – an advantage arising from their coarse-grained characteristics, simplifying CAD mapping tasks. We implement two previously published CGRAs as overlays on two commercial FPGAs (Intel and Xilinx), and consider the impact of the underlying FPGA architecture on the CGRA area and performance. We present optimizations for the overlays to take advantage of the FPGA architectural features and show a peak performance improvement of 1.93x, as well as maximum area savings of 31.1% and 48.5% for Intel and Xilinx, respectively, relative to a naive first-cut implementation. We also present a novel technique for a configurable multiplexer implementation, which embeds the select signals into SRAM configuration, saving 35.7% in area. The research is conducted using the open-source CGRA-ME (modeling and exploration) framework [1].","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Raparo: Resource-Level Angle-Based Parallel Routing for FPGAs 基于资源级角度的fpga并行路由

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00053

Minghua Shen, Nong Xiao

Routing is a time-consuming step in the FPGA compilation flow. The parallelization of routing has the potential to reduce the time but imposes the dependent problem as the inherent order of nets. In this paper, we present Raparo, a resource-level angle-based parallel router. Raparo exploits angle-based region partitioning to drive the assignment of the nets for efficient parallel routing on the multi-core processor systems. Raparo parallelizes the routing at resource level rather than region level for the similar convergence as the serial router. Results show that Raparo can scale to 32 processor cores to provide about 16x speedup on average with acceptable impacts on the quality of results, comparing to the serial router.

路由是FPGA编译流程中一个耗时的步骤。路由的并行化有可能减少时间，但将依赖问题作为网络的固有顺序。本文提出了一种基于资源级角度的并行路由器Raparo。Raparo利用基于角度的区域分区来驱动网络的分配，从而在多核处理器系统上实现高效的并行路由。Raparo在资源级而不是区域级并行路由，以达到与串行路由器相似的收敛性。结果表明，与串行路由器相比，Raparo可以扩展到32个处理器内核，平均提供约16倍的加速，对结果质量的影响是可以接受的。

引用次数: 0

An FPGA-Based Computing Infrastructure Tailored to Efficiently Scaffold Genome Sequences 一种基于fpga的计算基础架构，可有效地支撑基因组序列

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00074

Alberto Zeni, M. Crespi, Lorenzo Di Tucci, M. Santambrogio

In the current years broad access to genomic data is leading to improve the understanding and prevention of human diseases as never before. De-novo genome assembly, represents a main obstacle to perform the analysis on a large scale, as it is one of the most time-consuming phases of the genome analysis. In this paper, we present a scalable, high performance and energy efficient architecture for the alignment step of SSPACE, a state of the art tool used to perform scaffolding also in case of de-novo assembly. The final architecture is able to achieve up to 9.83x speedup in performance when compared to the software version of Bowtie, a state of the art tool used by SSPACE to perform the alignment.

近年来，基因组数据的广泛获取正在以前所未有的方式改善对人类疾病的理解和预防。De-novo基因组组装是进行大规模基因组分析的主要障碍，因为它是基因组分析中最耗时的阶段之一。在本文中，我们提出了一种可扩展，高性能和节能的SSPACE对准步骤体系结构，这是一种用于在从头组装的情况下执行脚手架的最先进工具。与软件版本的Bowtie相比，最终的架构能够实现高达9.83倍的性能加速，Bowtie是SSPACE用于执行对齐的最先进工具。

引用次数: 6

Maverick: A Stand-Alone CAD Flow for Partially Reconfigurable FPGA Modules Maverick:部分可重构FPGA模块的独立CAD流程

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00012

D. Glick, Jesse Grigg, B. Nelson, M. Wirthlin

This paper presents Maverick, a proof-of-concept computer-aided design (CAD) flow for generating reconfigurable modules (RMs) which target partial reconfiguration (PR) regions in field-programmable gate array (FPGA) designs. After an initial static design and PR region are created with Xilinx's Vivado PR flow, the Maverick flow can then compile and configure RMs onto that PR region—without the use of vendor tools. Maverick builds upon existing open source tools (Yosys, RapidSmith2, and Project X-Ray) to form an end-to-end compilation flow. This paper describes the Maverick flow and shows the results of it running on a PYNQ-Z1's ARM processor to compile a set of HDL designs to partial bitstreams. The resulting bitstreams were configured onto the PYNQ-Z1's FPGA fabric, demonstrating the feasibility of a single-chip embedded system which can both compile HDL designs to bitstreams and then configure them onto its own programmable fabric.

本文介绍了Maverick，一个概念验证计算机辅助设计(CAD)流程，用于生成可重构模块(rm)，其目标是现场可编程门阵列(FPGA)设计中的部分可重构(PR)区域。在使用Xilinx的Vivado PR流创建初始静态设计和PR区域之后，Maverick流可以在该PR区域上编译和配置rm，而无需使用供应商工具。Maverick构建在现有的开源工具(Yosys、RapidSmith2和Project X-Ray)之上，形成端到端的编译流。本文介绍了Maverick流程，并展示了它在PYNQ-Z1的ARM处理器上运行的结果，以编译一组部分位流的HDL设计。所得到的比特流被配置到PYNQ-Z1的FPGA结构上，证明了单芯片嵌入式系统既可以将HDL设计编译成比特流，又可以将它们配置到自己的可编程结构上的可行性。

引用次数: 2

High Precision, High Performance FPGA Adders 高精度，高性能FPGA加法器

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Pub Date : 2019-04-01 DOI: 10.1109/FCCM.2019.00047

M. Langhammer, B. Pasca, Gregg Baeckler

FPGAs are now being commonly used in the datacenter as smart Network Interface Cards (NICs), with cryptography as one of the strategic application areas. Public key cryptography algorithms in particular require arithmetic with thousands of bits of precision. Even an operation as simple as addition can be difficult for the FPGA when dealing with large integers, because of the high resource count and high latency needed to achieve usable performance levels with known methods. This paper examines the architecture and implementation of high-performance integer adders on FPGAs for widths ranging from 1024 to 8192 bits, in both single-instance and many-core chip-filling configurations. For chip-filling designs the routing impact of these wide busses are assessed, as they often have an impact outside the immediate locality of the structures. The architectures presented in this work show 1 to 2 orders magnitude reduction in the area-latency product over commonly used approaches. Routing congestion is managed, with near 100% logic efficiency (packing) for the adder function. Performance for these largely automatically placed designs are approximately the same as for carefully floor-planned non-arithmetic applications. In one example design, we show a 2048 bit adder in 5021 ALMs, with a latency of 6 clock cycles, at 628 MHz in a Stratix 10 E-2 device.

目前，fpga作为智能网络接口卡(nic)被广泛应用于数据中心，而密码学是其战略应用领域之一。特别是公钥加密算法需要具有数千位精度的算术。在处理大整数时，即使是像加法这样简单的操作对于FPGA来说也可能是困难的，因为使用已知方法实现可用性能水平所需的高资源计数和高延迟。本文研究了在单实例和多核芯片填充配置下，fpga上的高性能整数加法器的架构和实现，宽度范围从1024到8192位。对于芯片填充设计，评估这些宽总线的路由影响，因为它们通常在结构的直接位置之外产生影响。在这项工作中提出的架构显示，与常用方法相比，面积延迟产品降低了1到2个数量级。路由拥塞管理，接近100%的逻辑效率(包装)的加法器功能。这些很大程度上自动放置的设计的性能与精心规划的非算术应用程序的性能大致相同。在一个示例设计中，我们展示了5021 alm的2048位加法器，延迟为6个时钟周期，在Stratix 10 E-2器件中为628 MHz。

{"title":"High Precision, High Performance FPGA Adders","authors":"M. Langhammer, B. Pasca, Gregg Baeckler","doi":"10.1109/FCCM.2019.00047","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00047","url":null,"abstract":"FPGAs are now being commonly used in the datacenter as smart Network Interface Cards (NICs), with cryptography as one of the strategic application areas. Public key cryptography algorithms in particular require arithmetic with thousands of bits of precision. Even an operation as simple as addition can be difficult for the FPGA when dealing with large integers, because of the high resource count and high latency needed to achieve usable performance levels with known methods. This paper examines the architecture and implementation of high-performance integer adders on FPGAs for widths ranging from 1024 to 8192 bits, in both single-instance and many-core chip-filling configurations. For chip-filling designs the routing impact of these wide busses are assessed, as they often have an impact outside the immediate locality of the structures. The architectures presented in this work show 1 to 2 orders magnitude reduction in the area-latency product over commonly used approaches. Routing congestion is managed, with near 100% logic efficiency (packing) for the adder function. Performance for these largely automatically placed designs are approximately the same as for carefully floor-planned non-arithmetic applications. In one example design, we show a 2048 bit adder in 5021 ALMs, with a latency of 6 clock cycles, at 628 MHz in a Stratix 10 E-2 device.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130468049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀