Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

Session details: Session 1: Architecture 会话详细信息:会话1:架构

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3252936

S. Kaptanoglu

引用次数: 0

BoxPlacer: Force Directed-Based Timing-Driven Placement for Large-Scale FPGAs: (Abstract Only) BoxPlacer:大规模fpga的基于力定向的时序驱动放置(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174977

Minghua Shen, Jiaxi Zhang, Nong Xiao, Guojie Luo

Placement is probably the most critical process in the FPGA design flow. The demand for high performance continues to increase, but existing placers are still faced with numerous challenges including very long runtime, poor scalability, and restricted space exploration. In this paper we propose a novel timing-driven placement algorithm called BoxPlacer, which is supported by the force directed concept. BoxPlacer firstly uses a simple policy to create the initial box for placement. Then a force-directed iterative scheme is used to reduce the box size and determine the global placement. At last, the same concept is employed to eliminate the overlaps between reduced boxes to ensure the legalization in detailed placement. Notice that timing is always used to drive the placement in BoxPlacer. We demonstrate the effectiveness of our BoxPlacer by comparing the experimental results with that produced by the academic simulated annealing-based placer. Notably, our BoxPlacer achieves on average about 8x runtime advantage with 9% smaller critical path delay and 6% shorter wirelength.

放置可能是FPGA设计流程中最关键的过程。对高性能的需求不断增加，但现有的放置器仍然面临着许多挑战，包括运行时间过长、可扩展性差和空间探索受限。在本文中，我们提出了一种新的时间驱动的放置算法，称为BoxPlacer，它由力导向概念支持。BoxPlacer首先使用一个简单的策略来创建用于放置的初始框。然后采用一种力导向迭代方案来减小箱体尺寸并确定全局位置。最后，采用相同的概念，消除了减少框之间的重叠，保证了详细放置的法制化。注意，时间总是用于驱动BoxPlacer中的位置。通过将实验结果与学术模拟退火砂矿机的实验结果进行比较，证明了BoxPlacer的有效性。值得注意的是，我们的BoxPlacer实现了平均约8倍的运行时优势，减少了9%的关键路径延迟，缩短了6%的无线长度。

{"title":"BoxPlacer: Force Directed-Based Timing-Driven Placement for Large-Scale FPGAs: (Abstract Only)","authors":"Minghua Shen, Jiaxi Zhang, Nong Xiao, Guojie Luo","doi":"10.1145/3174243.3174977","DOIUrl":"https://doi.org/10.1145/3174243.3174977","url":null,"abstract":"Placement is probably the most critical process in the FPGA design flow. The demand for high performance continues to increase, but existing placers are still faced with numerous challenges including very long runtime, poor scalability, and restricted space exploration. In this paper we propose a novel timing-driven placement algorithm called BoxPlacer, which is supported by the force directed concept. BoxPlacer firstly uses a simple policy to create the initial box for placement. Then a force-directed iterative scheme is used to reduce the box size and determine the global placement. At last, the same concept is employed to eliminate the overlaps between reduced boxes to ensure the legalization in detailed placement. Notice that timing is always used to drive the placement in BoxPlacer. We demonstrate the effectiveness of our BoxPlacer by comparing the experimental results with that produced by the academic simulated annealing-based placer. Notably, our BoxPlacer achieves on average about 8x runtime advantage with 9% smaller critical path delay and 6% shorter wirelength.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122761093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Architecture and Circuit Design of an All-Spintronic FPGA 全自旋电子FPGA的结构与电路设计

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174256

Stephen M. Williams, Mingjie Lin

Reconfigurable logic device, such as FPGA, has been well-known to be the driver of cutting-edge device technology. In the last five years, there have been extensive studies on constructing novel FPGA devices using CMOS technology combined with emerging spin- tronic devices. Unfortunately, although spintronic device technol- ogy promises desirable features such as non-volatility and high area density, its relatively slow switching speed makes it quite chal- lenging to use them as drop-in replacements for CMOS transistors. As such, to fully unlock the performance benefits of spintronic de- vices, it is imperative to develop innovative design techniques of circuit and architecture that are custom-made for building high- performance FPGA devices. In this paper, we aim at fully extracting the benefits of new spin-based device technology through innovative circuit and architecture design techniques for FPGAs. Specifically, we exploit the unique characteristics of a domain-wall logic device called the mCell to achieve a direct mapping to NAND-NOR logic and in doing so create a high-throughput non-volatile alternative to LUT-based CMOS reconfigurable logic. To empirically validate our approach, we have performed extensive HSpice circuit simulations. Our simulation results have shown that, for a similar logic capacity, the NAND-NOR FPGA design with mCell devices excels across all metrics when compared to the CMOS NAND-NOR FPGA design. Not only do we reduce average delay by about 17%, but we also improve path delay variance between different logic block configurations by about 59%, which can ease the burden on the FPGA timing analysis CAD tools by having more consistent delay between configurations. To judge the performance of our mCell FPGA in practical applications, we measured it against the Stratix IV LUT-based FPGA for the MCNC and VTR benchmark suites. Our mCell-based FPGA devices prove to be quite competitive against the CMOS LUT-based FPGA design, on average reducing delay and area by approximately 26% and 64% for the MCNC benchmark, and 13% and 55% for the VTR benchmark respectively.

可重构逻辑器件，如FPGA，已被公认为是尖端器件技术的驱动者。在过去的五年中，人们对利用CMOS技术结合新兴的自旋电子器件构建新型FPGA器件进行了广泛的研究。不幸的是，尽管自旋电子器件技术承诺了理想的特性，如无挥发性和高面积密度，但其相对较慢的开关速度使得将其用作CMOS晶体管的替代产品相当具有挑战性。因此，为了充分发挥自旋电子器件的性能优势，必须开发用于构建高性能FPGA器件的电路和体系结构的创新设计技术。在本文中，我们旨在通过fpga的创新电路和架构设计技术，充分利用新的基于自旋的器件技术的优势。具体来说，我们利用称为mCell的域壁逻辑器件的独特特性来实现与NAND-NOR逻辑的直接映射，从而创建了基于lut的CMOS可重构逻辑的高通量非易失性替代方案。为了从经验上验证我们的方法，我们进行了广泛的HSpice电路模拟。我们的仿真结果表明，对于类似的逻辑容量，与CMOS NAND-NOR FPGA设计相比，带有mCell器件的NAND-NOR FPGA设计在所有指标上都优于NAND-NOR FPGA设计。我们不仅将平均延迟降低了约17%，而且还将不同逻辑块配置之间的路径延迟差异提高了约59%，这可以通过在配置之间提供更一致的延迟来减轻FPGA时序分析CAD工具的负担。为了判断我们的mCell FPGA在实际应用中的性能，我们针对MCNC和VTR基准套件对基于Stratix IV lut的FPGA进行了测量。我们基于mccell的FPGA器件与基于CMOS lutt的FPGA设计相比具有相当的竞争力，在MCNC基准测试中平均减少了26%和64%的延迟和面积，在VTR基准测试中分别减少了13%和55%。

{"title":"Architecture and Circuit Design of an All-Spintronic FPGA","authors":"Stephen M. Williams, Mingjie Lin","doi":"10.1145/3174243.3174256","DOIUrl":"https://doi.org/10.1145/3174243.3174256","url":null,"abstract":"Reconfigurable logic device, such as FPGA, has been well-known to be the driver of cutting-edge device technology. In the last five years, there have been extensive studies on constructing novel FPGA devices using CMOS technology combined with emerging spin- tronic devices. Unfortunately, although spintronic device technol- ogy promises desirable features such as non-volatility and high area density, its relatively slow switching speed makes it quite chal- lenging to use them as drop-in replacements for CMOS transistors. As such, to fully unlock the performance benefits of spintronic de- vices, it is imperative to develop innovative design techniques of circuit and architecture that are custom-made for building high- performance FPGA devices. In this paper, we aim at fully extracting the benefits of new spin-based device technology through innovative circuit and architecture design techniques for FPGAs. Specifically, we exploit the unique characteristics of a domain-wall logic device called the mCell to achieve a direct mapping to NAND-NOR logic and in doing so create a high-throughput non-volatile alternative to LUT-based CMOS reconfigurable logic. To empirically validate our approach, we have performed extensive HSpice circuit simulations. Our simulation results have shown that, for a similar logic capacity, the NAND-NOR FPGA design with mCell devices excels across all metrics when compared to the CMOS NAND-NOR FPGA design. Not only do we reduce average delay by about 17%, but we also improve path delay variance between different logic block configurations by about 59%, which can ease the burden on the FPGA timing analysis CAD tools by having more consistent delay between configurations. To judge the performance of our mCell FPGA in practical applications, we measured it against the Stratix IV LUT-based FPGA for the MCNC and VTR benchmark suites. Our mCell-based FPGA devices prove to be quite competitive against the CMOS LUT-based FPGA design, on average reducing delay and area by approximately 26% and 64% for the MCNC benchmark, and 13% and 55% for the VTR benchmark respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114042700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA 轻量级YOLOv2:基于FPGA的并行支持向量回归的二值化CNN

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174266

Hiroki Nakahara, H. Yonekawa, Tomoya Fujii, Shimpei Sato

A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an FPGA, which is faster and has a higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for a feature extraction and the parallel support vector regression (SVR) for both a classification and a localization. To our knowledge, this is the first time binarized CNN»s have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.

帧目标检测问题包括两个问题:一是对空间分离边界框的回归问题，二是在实时帧率下对目标进行相关分类。它广泛应用于嵌入式系统，如机器人，自动驾驶，安全和无人机-所有这些都需要高性能和低功耗。本文在FPGA上实现了YOLO (You only look once)目标检测器，该检测器速度更快，精度更高。它基于卷积深度神经网络(CNN)，在性能和面积上都占主导地位。然而，基于CNN的目标检测器由边界框预测(回归)和类估计(分类)组成。因此，传统的全二值化CNN在大多数情况下无法识别。在本文中，我们提出了一个轻量级的YOLOv2，它由用于特征提取的二值化CNN和用于分类和定位的并行支持向量回归(SVR)组成。据我们所知，这是首次成功地将二值化的CNN图像用于目标检测。我们在Xilinx Inc. zcu102板上为轻量级YOLOv2实现了基于流水线的架构，该板具有Xilinx Inc.。Zynq Ultrascale+ MPSoC。实现的目标检测器存档40.81帧每秒(FPS)。与ARM Cortex-A57相比，它的速度提高了177.4倍，功耗提高了1.1倍，单位功率效率提高了158.9倍。此外，与nVidia Pascall嵌入式GPU相比，它的速度快27.5倍，功耗低1.5倍，单位功率效率的性能提高了42.9倍。因此，我们的方法适用于嵌入式视觉系统的帧目标检测。

{"title":"A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA","authors":"Hiroki Nakahara, H. Yonekawa, Tomoya Fujii, Shimpei Sato","doi":"10.1145/3174243.3174266","DOIUrl":"https://doi.org/10.1145/3174243.3174266","url":null,"abstract":"A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an FPGA, which is faster and has a higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for a feature extraction and the parallel support vector regression (SVR) for both a classification and a localization. To our knowledge, this is the first time binarized CNN»s have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128455771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114

A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study 针对Intel HARPv2 Xeon+FPGA平台的可定制矩阵乘法框架:深度学习案例研究

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174258

Duncan J. M. Moss, Krishnan Srivatsan, E. Nurvitadhi, P. Ratuszniak, Chris Johnson, Jaewoong Sim, Asit K. Mishra, Debbie Marr, S. Subhaschandra, P. Leong

General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.

通用矩阵到矩阵乘法(GEMM)是高性能计算(HPC)、科学计算(SC)以及最近的深度学习中广泛应用的基石。在这项工作中，我们为英特尔HARPv2 CPU+FPGA平台提出了一个可定制的矩阵乘法框架，该框架包括对传统单精度浮点和低精度工作负载的支持。我们的框架支持任意大小的gem，由两部分组成:(1)一个简单的应用程序编程接口(API)，便于配置和集成到现有软件中;(2)一个高度可定制的硬件模板。该API提供了编译和运行时选项，用于控制硬件模板的关键方面，包括动态精确切换;交错和块大小控制;并融合了深度学习的具体操作。该框架目前支持单精度浮点(FP32)， 16,8,4和2位整数和定点(INT16, INT8, INT4, INT2)以及更多用于深度学习工作负载的外来数据类型:INT16xTernary, INT8xTernary, BinaryxBinary。我们将我们的实现与最新的NVIDIA Pascal GPU进行比较，并评估内置在硬件模板中的优化所提供的性能优势。使用三个神经网络(AlexNet, VGGNet和ResNet)，我们说明了降低精度表示(如二进制)可以实现最佳性能，并且HARPv2可以在Xeon和FPGA上实现细粒度的计算分区。我们观察到，与单精度浮点相比，执行时间提高了50倍，运行时配置选项可以将AlexNet中某些层的效率提高4倍，从而在整个网络中实现1.3倍的总体改进。

{"title":"A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study","authors":"Duncan J. M. Moss, Krishnan Srivatsan, E. Nurvitadhi, P. Ratuszniak, Chris Johnson, Jaewoong Sim, Asit K. Mishra, Debbie Marr, S. Subhaschandra, P. Leong","doi":"10.1145/3174243.3174258","DOIUrl":"https://doi.org/10.1145/3174243.3174258","url":null,"abstract":"General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

High-Performance QR Decomposition for FPGAs fpga的高性能QR分解

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174273

M. Langhammer, B. Pasca

QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.

QR分解(QRD)在当前的许多应用中越来越重要，例如无线和雷达。已知算法和方法中的数据依赖关系，加上许多这些方法中使用的数据访问模式，限制了软件可编程目标的可实现性能。一些FPGA架构现在结合了硬浮点(HFP)资源，并与分布式内存以及内部连接的灵活性相结合，可以支持高性能矩阵算法。在这项工作中，我们提出了一种新的QRD算法的映射到具有向量间连通性的并行结构。该算法基于改进的Gram-Schmidt (MGS)算法，具有不同的循环组织，但相关功能序列不变，因此不影响误差分析和数值稳定性。对于大型矩阵，这项工作在理论上具有接近100%的持续峰值性能，这大约是以前最知名实现的功能密度的三倍。映射到Intel Arria 10器件，我们实现了256x256单精度实矩阵的80us，相当于417 GFLOP。对于用于此工作的设备部分，这对应于95%的持续峰值比。

{"title":"High-Performance QR Decomposition for FPGAs","authors":"M. Langhammer, B. Pasca","doi":"10.1145/3174243.3174273","DOIUrl":"https://doi.org/10.1145/3174243.3174273","url":null,"abstract":"QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126592228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Optimizations of Sequence Alignment on FPGA: A Case Study of Extended Sequence Alignment (Abstact Only) 基于FPGA的序列对齐优化:以扩展序列对齐为例(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174958

Zheming Jin, Kazutomo Yoshii

Detecting similarities between sequences is an important part of Bioinformatics. In this poster, we explore the use of high-level synthesis tool and a field-programmable gate array (FPGA) for optimizing a sequence alignment algorithm. We demonstrate the optimization techniques to improve the performance of the extended sequence alignment algorithm in the BWA software package, a tool for mapping DNA sequences against a large reference sequence. Applying the optimizations to the algorithm using Xilinx SDAccel OpenCL-to-FPGA tool, we reduce the kernel execution time from 62.8 ms to 0.45 ms while the power consumption is approximately 11 Watts on the ADM-PCIE-8K5 FPGA platform.

序列间相似性检测是生物信息学的重要组成部分。在这张海报中，我们探索了使用高级合成工具和现场可编程门阵列(FPGA)来优化序列对齐算法。我们演示了优化技术，以提高BWA软件包中的扩展序列比对算法的性能，BWA软件包是一个针对大型参考序列进行DNA序列比对的工具。使用Xilinx SDAccel OpenCL-to-FPGA工具对算法进行优化，我们将内核执行时间从62.8 ms减少到0.45 ms，而在ADM-PCIE-8K5 FPGA平台上功耗约为11瓦。

引用次数: 0

Domino: An Asynchronous and Energy-efficient Accelerator for Graph Processing: (Abstract Only) Domino:一种异步和节能的图形处理加速器(仅摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174973

Chongchong Xu, Chao Wang, Yiwei Zhang, Lei Gong, Xi Li, Xuehai Zhou

Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.

大规模图处理在社交网络、网络图、交通网络等领域的应用日益受到研究人员的关注。然而，在普通处理器上处理大规模图形存在计算和内存效率低下等困难。因此，图形处理硬件加速器的研究已成为近年来的研究热点。同时，FPGA作为一种节能和可重构的资源，是设计和应用图形处理算法的潜在解决方案。在本文中，我们提出了Domino，一个异步和节能的硬件加速器用于图形处理。Domino采用异步模型来处理图，这对于大多数图算法(如广度优先搜索、深度优先搜索和单源最短路径)都是有效的。Domino还提出了一种基于行向量的特定数据结构，称为Batch row vector，用于表示图形。我们的工作采用朴素更新机制和对分更新机制来实现异步控制。最终，我们在高级Xilinx Virtex-7板上实现了Domino，实验结果表明Domino具有显著的性能和能量改进，特别是对于具有大直径的图形(例如:， roadNet-CA和USA-Road)。在Domino的案例研究中，对于小直径图(例如:(如com-youtube、WikiTalk和soc-LiveJournal)，分别在英特尔Core2和Core i7处理器上优于GraphChi。此外，与Intel Core i7处理器相比，Domino的能效也非常显著，三个小直径图的能效为2.03x-10.08x, roadNet-CA的能效为27.98x-134.50x, roadNet-CA是一个直径较大的图。

{"title":"Domino: An Asynchronous and Energy-efficient Accelerator for Graph Processing: (Abstract Only)","authors":"Chongchong Xu, Chao Wang, Yiwei Zhang, Lei Gong, Xi Li, Xuehai Zhou","doi":"10.1145/3174243.3174973","DOIUrl":"https://doi.org/10.1145/3174243.3174973","url":null,"abstract":"Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123988814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only) IEEE 802.11p MLME中多通道调度的软硬件协同设计(摘要)

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174971

N. Ding, Wei Zhang, Yanhua Ma, Zhen-guo Gao

The capacity of IEEE 802.11p communication in vehicular ad hoc networks (VANETs) is widely sensitive to the tradeoff between control channel (CCH) and service channels (SCHs), which is particularly obvious in the different traffic flow condition. This paper proposes a hybrid multichannel scheduling algorithm with FPGA and traffic flow forecasting based on Kalman Filter (HMS-FFK) according to the extended SCH access mechanism mentioned in IEEE 1609.4 protocol. In HMS-FFK, a Random CCH Transmission Request Probability is defined to describe the CCH message congestion probability according to the local traffic flow density. Then, a hardware prototype of MAC sublayer management entities (MLME) based on HMS-FFK scheduling (MLME-HMS) is designed with FPGA, which is flexible to be integrated in the 802.11p communication system by the PCI interface. Theoretical analysis and simulation results show that the proposed scheme and hardware prototype of MLME are able to help IEEE 1609.4 MAC to optimize the throughput of SCHs and reduce the transmission delay of CCH in the different traffic flow condition.

车辆自组织网络(vanet)中IEEE 802.11p通信容量对控制信道(CCH)和业务信道(SCHs)之间的权衡非常敏感，在不同的交通流条件下表现得尤为明显。本文根据IEEE 1609.4协议中扩展的SCH接入机制，提出了一种基于FPGA和基于卡尔曼滤波的流量预测混合多通道调度算法(HMS-FFK)。在HMS-FFK中，定义了随机CCH传输请求概率(Random CCH Transmission Request Probability)，根据本地流量密度描述CCH报文拥塞概率。然后，利用FPGA设计了基于HMS-FFK调度的MAC子层管理实体(MLME- hms)的硬件原型，该原型可以通过PCI接口灵活地集成到802.11p通信系统中。理论分析和仿真结果表明，所提出的方案和MLME硬件原型能够帮助IEEE 1609.4 MAC优化sch的吞吐量，降低不同交通流条件下CCH的传输延迟。

{"title":"Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only)","authors":"N. Ding, Wei Zhang, Yanhua Ma, Zhen-guo Gao","doi":"10.1145/3174243.3174971","DOIUrl":"https://doi.org/10.1145/3174243.3174971","url":null,"abstract":"The capacity of IEEE 802.11p communication in vehicular ad hoc networks (VANETs) is widely sensitive to the tradeoff between control channel (CCH) and service channels (SCHs), which is particularly obvious in the different traffic flow condition. This paper proposes a hybrid multichannel scheduling algorithm with FPGA and traffic flow forecasting based on Kalman Filter (HMS-FFK) according to the extended SCH access mechanism mentioned in IEEE 1609.4 protocol. In HMS-FFK, a Random CCH Transmission Request Probability is defined to describe the CCH message congestion probability according to the local traffic flow density. Then, a hardware prototype of MAC sublayer management entities (MLME) based on HMS-FFK scheduling (MLME-HMS) is designed with FPGA, which is flexible to be integrated in the 802.11p communication system by the PCI interface. Theoretical analysis and simulation results show that the proposed scheme and hardware prototype of MLME are able to help IEEE 1609.4 MAC to optimize the throughput of SCHs and reduce the transmission delay of CCH in the different traffic flow condition.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114675286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform 通过FPGA-HMC平台上的存储和访问协同优化加速图形分析

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2018-02-15 DOI: 10.1145/3174243.3174260

Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li

Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.

图分析，探索相互关联的实体之间的关系，由于其广泛的适用性，从机器学习到社会科学，正变得越来越重要。然而，由于图计算中的数据访问模式不规范，对图处理系统的一个主要挑战是性能。为主流并行应用程序量身定制的算法、软件和硬件，由于其复杂和不规则的结构，通常对来自现实世界问题的大量稀疏图无效。为了解决大规模图形分析中的性能问题，我们利用新兴的混合存储立方体(HMC)的卓越随机访问性能，结合现代fpga的灵活性和效率。特别是，我们开发了一种协作的软件/硬件技术，用于在FPGA-HMC平台上执行电平同步的广度优先搜索(BFS)。从软件的角度来看，我们开发了一种架构感知的图聚类算法，该算法利用FPGA-HMC平台的能力来提高数据局域性和内存访问效率。从硬件角度来看，我们通过设计内存请求合并单元进一步改进FPGA-HMC图处理器架构，以利用图聚类带来的数据局域性增加。我们使用美光的AC-510开发套件评估了BFS实现的性能，与最新的基于FPGA-HMC的图形处理系统相比，通过一系列来自广泛应用的基准测试，实现了2.8倍的平均性能提升。

{"title":"Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform","authors":"Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li","doi":"10.1145/3174243.3174260","DOIUrl":"https://doi.org/10.1145/3174243.3174260","url":null,"abstract":"Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114965860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33