{"title":"Session details: Session 1: Architecture","authors":"S. Kaptanoglu","doi":"10.1145/3252936","DOIUrl":"https://doi.org/10.1145/3252936","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115082964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Placement is probably the most critical process in the FPGA design flow. The demand for high performance continues to increase, but existing placers are still faced with numerous challenges including very long runtime, poor scalability, and restricted space exploration. In this paper we propose a novel timing-driven placement algorithm called BoxPlacer, which is supported by the force directed concept. BoxPlacer firstly uses a simple policy to create the initial box for placement. Then a force-directed iterative scheme is used to reduce the box size and determine the global placement. At last, the same concept is employed to eliminate the overlaps between reduced boxes to ensure the legalization in detailed placement. Notice that timing is always used to drive the placement in BoxPlacer. We demonstrate the effectiveness of our BoxPlacer by comparing the experimental results with that produced by the academic simulated annealing-based placer. Notably, our BoxPlacer achieves on average about 8x runtime advantage with 9% smaller critical path delay and 6% shorter wirelength.
{"title":"BoxPlacer: Force Directed-Based Timing-Driven Placement for Large-Scale FPGAs: (Abstract Only)","authors":"Minghua Shen, Jiaxi Zhang, Nong Xiao, Guojie Luo","doi":"10.1145/3174243.3174977","DOIUrl":"https://doi.org/10.1145/3174243.3174977","url":null,"abstract":"Placement is probably the most critical process in the FPGA design flow. The demand for high performance continues to increase, but existing placers are still faced with numerous challenges including very long runtime, poor scalability, and restricted space exploration. In this paper we propose a novel timing-driven placement algorithm called BoxPlacer, which is supported by the force directed concept. BoxPlacer firstly uses a simple policy to create the initial box for placement. Then a force-directed iterative scheme is used to reduce the box size and determine the global placement. At last, the same concept is employed to eliminate the overlaps between reduced boxes to ensure the legalization in detailed placement. Notice that timing is always used to drive the placement in BoxPlacer. We demonstrate the effectiveness of our BoxPlacer by comparing the experimental results with that produced by the academic simulated annealing-based placer. Notably, our BoxPlacer achieves on average about 8x runtime advantage with 9% smaller critical path delay and 6% shorter wirelength.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122761093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reconfigurable logic device, such as FPGA, has been well-known to be the driver of cutting-edge device technology. In the last five years, there have been extensive studies on constructing novel FPGA devices using CMOS technology combined with emerging spin- tronic devices. Unfortunately, although spintronic device technol- ogy promises desirable features such as non-volatility and high area density, its relatively slow switching speed makes it quite chal- lenging to use them as drop-in replacements for CMOS transistors. As such, to fully unlock the performance benefits of spintronic de- vices, it is imperative to develop innovative design techniques of circuit and architecture that are custom-made for building high- performance FPGA devices. In this paper, we aim at fully extracting the benefits of new spin-based device technology through innovative circuit and architecture design techniques for FPGAs. Specifically, we exploit the unique characteristics of a domain-wall logic device called the mCell to achieve a direct mapping to NAND-NOR logic and in doing so create a high-throughput non-volatile alternative to LUT-based CMOS reconfigurable logic. To empirically validate our approach, we have performed extensive HSpice circuit simulations. Our simulation results have shown that, for a similar logic capacity, the NAND-NOR FPGA design with mCell devices excels across all metrics when compared to the CMOS NAND-NOR FPGA design. Not only do we reduce average delay by about 17%, but we also improve path delay variance between different logic block configurations by about 59%, which can ease the burden on the FPGA timing analysis CAD tools by having more consistent delay between configurations. To judge the performance of our mCell FPGA in practical applications, we measured it against the Stratix IV LUT-based FPGA for the MCNC and VTR benchmark suites. Our mCell-based FPGA devices prove to be quite competitive against the CMOS LUT-based FPGA design, on average reducing delay and area by approximately 26% and 64% for the MCNC benchmark, and 13% and 55% for the VTR benchmark respectively.
可重构逻辑器件,如FPGA,已被公认为是尖端器件技术的驱动者。在过去的五年中,人们对利用CMOS技术结合新兴的自旋电子器件构建新型FPGA器件进行了广泛的研究。不幸的是,尽管自旋电子器件技术承诺了理想的特性,如无挥发性和高面积密度,但其相对较慢的开关速度使得将其用作CMOS晶体管的替代产品相当具有挑战性。因此,为了充分发挥自旋电子器件的性能优势,必须开发用于构建高性能FPGA器件的电路和体系结构的创新设计技术。在本文中,我们旨在通过fpga的创新电路和架构设计技术,充分利用新的基于自旋的器件技术的优势。具体来说,我们利用称为mCell的域壁逻辑器件的独特特性来实现与NAND-NOR逻辑的直接映射,从而创建了基于lut的CMOS可重构逻辑的高通量非易失性替代方案。为了从经验上验证我们的方法,我们进行了广泛的HSpice电路模拟。我们的仿真结果表明,对于类似的逻辑容量,与CMOS NAND-NOR FPGA设计相比,带有mCell器件的NAND-NOR FPGA设计在所有指标上都优于NAND-NOR FPGA设计。我们不仅将平均延迟降低了约17%,而且还将不同逻辑块配置之间的路径延迟差异提高了约59%,这可以通过在配置之间提供更一致的延迟来减轻FPGA时序分析CAD工具的负担。为了判断我们的mCell FPGA在实际应用中的性能,我们针对MCNC和VTR基准套件对基于Stratix IV lut的FPGA进行了测量。我们基于mccell的FPGA器件与基于CMOS lutt的FPGA设计相比具有相当的竞争力,在MCNC基准测试中平均减少了26%和64%的延迟和面积,在VTR基准测试中分别减少了13%和55%。
{"title":"Architecture and Circuit Design of an All-Spintronic FPGA","authors":"Stephen M. Williams, Mingjie Lin","doi":"10.1145/3174243.3174256","DOIUrl":"https://doi.org/10.1145/3174243.3174256","url":null,"abstract":"Reconfigurable logic device, such as FPGA, has been well-known to be the driver of cutting-edge device technology. In the last five years, there have been extensive studies on constructing novel FPGA devices using CMOS technology combined with emerging spin- tronic devices. Unfortunately, although spintronic device technol- ogy promises desirable features such as non-volatility and high area density, its relatively slow switching speed makes it quite chal- lenging to use them as drop-in replacements for CMOS transistors. As such, to fully unlock the performance benefits of spintronic de- vices, it is imperative to develop innovative design techniques of circuit and architecture that are custom-made for building high- performance FPGA devices. In this paper, we aim at fully extracting the benefits of new spin-based device technology through innovative circuit and architecture design techniques for FPGAs. Specifically, we exploit the unique characteristics of a domain-wall logic device called the mCell to achieve a direct mapping to NAND-NOR logic and in doing so create a high-throughput non-volatile alternative to LUT-based CMOS reconfigurable logic. To empirically validate our approach, we have performed extensive HSpice circuit simulations. Our simulation results have shown that, for a similar logic capacity, the NAND-NOR FPGA design with mCell devices excels across all metrics when compared to the CMOS NAND-NOR FPGA design. Not only do we reduce average delay by about 17%, but we also improve path delay variance between different logic block configurations by about 59%, which can ease the burden on the FPGA timing analysis CAD tools by having more consistent delay between configurations. To judge the performance of our mCell FPGA in practical applications, we measured it against the Stratix IV LUT-based FPGA for the MCNC and VTR benchmark suites. Our mCell-based FPGA devices prove to be quite competitive against the CMOS LUT-based FPGA design, on average reducing delay and area by approximately 26% and 64% for the MCNC benchmark, and 13% and 55% for the VTR benchmark respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114042700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroki Nakahara, H. Yonekawa, Tomoya Fujii, Shimpei Sato
A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an FPGA, which is faster and has a higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for a feature extraction and the parallel support vector regression (SVR) for both a classification and a localization. To our knowledge, this is the first time binarized CNN»s have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.
帧目标检测问题包括两个问题:一是对空间分离边界框的回归问题,二是在实时帧率下对目标进行相关分类。它广泛应用于嵌入式系统,如机器人,自动驾驶,安全和无人机-所有这些都需要高性能和低功耗。本文在FPGA上实现了YOLO (You only look once)目标检测器,该检测器速度更快,精度更高。它基于卷积深度神经网络(CNN),在性能和面积上都占主导地位。然而,基于CNN的目标检测器由边界框预测(回归)和类估计(分类)组成。因此,传统的全二值化CNN在大多数情况下无法识别。在本文中,我们提出了一个轻量级的YOLOv2,它由用于特征提取的二值化CNN和用于分类和定位的并行支持向量回归(SVR)组成。据我们所知,这是首次成功地将二值化的CNN图像用于目标检测。我们在Xilinx Inc. zcu102板上为轻量级YOLOv2实现了基于流水线的架构,该板具有Xilinx Inc.。Zynq Ultrascale+ MPSoC。实现的目标检测器存档40.81帧每秒(FPS)。与ARM Cortex-A57相比,它的速度提高了177.4倍,功耗提高了1.1倍,单位功率效率提高了158.9倍。此外,与nVidia Pascall嵌入式GPU相比,它的速度快27.5倍,功耗低1.5倍,单位功率效率的性能提高了42.9倍。因此,我们的方法适用于嵌入式视觉系统的帧目标检测。
{"title":"A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an FPGA","authors":"Hiroki Nakahara, H. Yonekawa, Tomoya Fujii, Shimpei Sato","doi":"10.1145/3174243.3174266","DOIUrl":"https://doi.org/10.1145/3174243.3174266","url":null,"abstract":"A frame object detection problem consists of two problems: one is a regression problem to spatially separated bounding boxes, the second is the associated classification of the objects within realtime frame rate. It is widely used in the embedded systems, such as robotics, autonomous driving, security, and drones - all of which require high-performance and low-power consumption. This paper implements the YOLO (You only look once) object detector on an FPGA, which is faster and has a higher accuracy. It is based on the convolutional deep neural network (CNN), and it is a dominant part both the performance and the area. However, the object detector based on the CNN consists of a bounding box prediction (regression) and a class estimation (classification). Thus, the conventional all binarized CNN fails to recognize in most cases. In the paper, we propose a lightweight YOLOv2, which consists of the binarized CNN for a feature extraction and the parallel support vector regression (SVR) for both a classification and a localization. To our knowledge, this is the first time binarized CNN»s have been successfully used in object detection. We implement a pipelined based architecture for the lightweight YOLOv2 on the Xilinx Inc. zcu102 board, which has the Xilinx Inc. Zynq Ultrascale+ MPSoC. The implemented object detector archived 40.81 frames per second (FPS). Compared with the ARM Cortex-A57, it was 177.4 times faster, it dissipated 1.1 times more power, and its performance per power efficiency was 158.9 times better. Also, compared with the nVidia Pascall embedded GPU, it was 27.5 times faster, it dissipated 1.5 times lower power, and its performance per power efficiency was 42.9 times better. Thus, our method is suitable for the frame object detector for an embedded vision system.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128455771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duncan J. M. Moss, Krishnan Srivatsan, E. Nurvitadhi, P. Ratuszniak, Chris Johnson, Jaewoong Sim, Asit K. Mishra, Debbie Marr, S. Subhaschandra, P. Leong
General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.
{"title":"A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform: A Deep Learning Case Study","authors":"Duncan J. M. Moss, Krishnan Srivatsan, E. Nurvitadhi, P. Ratuszniak, Chris Johnson, Jaewoong Sim, Asit K. Mishra, Debbie Marr, S. Subhaschandra, P. Leong","doi":"10.1145/3174243.3174258","DOIUrl":"https://doi.org/10.1145/3174243.3174258","url":null,"abstract":"General Matrix to Matrix multiplication (GEMM) is the cornerstone for a wide gamut of applications in high performance computing (HPC), scientific computing (SC) and more recently, deep learning. In this work, we present a customizable matrix multiplication framework for the Intel HARPv2 CPU+FPGA platform that includes support for both traditional single precision floating point and reduced precision workloads. Our framework supports arbitrary size GEMMs and consists of two parts: (1) a simple application programming interface (API) for easy configuration and integration into existing software and (2) a highly customizable hardware template. The API provides both compile and runtime options for controlling key aspects of the hardware template including dynamic precision switching; interleaving and block size control; and fused deep learning specific operations. The framework currently supports single precision floating point (FP32), 16, 8, 4 and 2 bit Integer and Fixed Point (INT16, INT8, INT4, INT2) and more exotic data types for deep learning workloads: INT16xTernary, INT8xTernary, BinaryxBinary. We compare our implementation to the latest NVIDIA Pascal GPU and evaluate the performance benefits provided by optimizations built into the hardware template. Using three neural networks (AlexNet, VGGNet and ResNet) we illustrate that reduced precision representations such as binary achieve the best performance, and that the HARPv2 enables fine-grained partitioning of computations over both the Xeon and FPGA. We observe up to 50x improvement in execution time compared to single precision floating point, and that runtime configuration options can improve the efficiency of certain layers in AlexNet up to 4x, achieving an overall 1.3x improvement over the entire network.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.
{"title":"High-Performance QR Decomposition for FPGAs","authors":"M. Langhammer, B. Pasca","doi":"10.1145/3174243.3174273","DOIUrl":"https://doi.org/10.1145/3174243.3174273","url":null,"abstract":"QR decomposition (QRD) is of increasing importance for many current applications, such as wireless and radar. Data dependencies in known algorithms and approaches, combined with the data access patterns used in many of these methods, restrict the achievable performance in software programmable targets. Some FPGA architectures now incorporate hard floating-point (HFP) resources, and in combination with distributed memories, as well as the flexibility of internal connectivity, can support high-performance matrix arithmetic. In this work, we present the mapping to parallel structures with inter-vector connectivity of a new QRD algorithm. Based on a Modified Gram-Schmidt (MGS) algorithm, this new algorithm has a different loop organization, but the dependent functional sequences are unchanged, so error analysis and numerical stability are unaffected. This work has a theoretical sustained-to-peak performance close to 100% for large matrices, which is roughly three times the functional density of the previously best known implementations. Mapped to an Intel Arria 10 device, we achieve 80us for a 256x256 single precision real matrix, for a 417 GFLOP equivalent. This corresponds to a 95% sustained to peak ratio, for the portion of the device used for this work.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126592228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting similarities between sequences is an important part of Bioinformatics. In this poster, we explore the use of high-level synthesis tool and a field-programmable gate array (FPGA) for optimizing a sequence alignment algorithm. We demonstrate the optimization techniques to improve the performance of the extended sequence alignment algorithm in the BWA software package, a tool for mapping DNA sequences against a large reference sequence. Applying the optimizations to the algorithm using Xilinx SDAccel OpenCL-to-FPGA tool, we reduce the kernel execution time from 62.8 ms to 0.45 ms while the power consumption is approximately 11 Watts on the ADM-PCIE-8K5 FPGA platform.
{"title":"Optimizations of Sequence Alignment on FPGA: A Case Study of Extended Sequence Alignment (Abstact Only)","authors":"Zheming Jin, Kazutomo Yoshii","doi":"10.1145/3174243.3174958","DOIUrl":"https://doi.org/10.1145/3174243.3174958","url":null,"abstract":"Detecting similarities between sequences is an important part of Bioinformatics. In this poster, we explore the use of high-level synthesis tool and a field-programmable gate array (FPGA) for optimizing a sequence alignment algorithm. We demonstrate the optimization techniques to improve the performance of the extended sequence alignment algorithm in the BWA software package, a tool for mapping DNA sequences against a large reference sequence. Applying the optimizations to the algorithm using Xilinx SDAccel OpenCL-to-FPGA tool, we reduce the kernel execution time from 62.8 ms to 0.45 ms while the power consumption is approximately 11 Watts on the ADM-PCIE-8K5 FPGA platform.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131067329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chongchong Xu, Chao Wang, Yiwei Zhang, Lei Gong, Xi Li, Xuehai Zhou
Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.
{"title":"Domino: An Asynchronous and Energy-efficient Accelerator for Graph Processing: (Abstract Only)","authors":"Chongchong Xu, Chao Wang, Yiwei Zhang, Lei Gong, Xi Li, Xuehai Zhou","doi":"10.1145/3174243.3174973","DOIUrl":"https://doi.org/10.1145/3174243.3174973","url":null,"abstract":"Large-scale graphs processing, which draws attentions of researchers, applies in a large range of domains, such as social networks, web graphs, and transport networks. However, processing large-scale graphs on general processors suffers from difficulties including computation and memory inefficiency. Therefore, the research of hardware accelerator for graph processing has become a hot issue recently. Meanwhile, as a power-efficiency and reconfigurable resource, FPGA is a potential solution to design and employ graph processing algorithms. In this paper, we propose Domino, an asynchronous and energy-efficient hardware accelerator for graph processing. Domino adopts the asynchronous model to process graphs, which is efficient for most of the graph algorithms, such as Breadth-First Search, Depth-First Search, and Single Source Shortest Path. Domino also proposes a specific data structure based on row vector, named Batch Row Vector, to present graphs. Our work adopts the naive update mechanism and bisect update mechanism to perform asynchronous control. Ultimately, we implement Domino on an advanced Xilinx Virtex-7 board, and experimental results demonstrate that Domino has significant performance and energy improvement, especially for graphs with a large diameter(e.g., roadNet-CA and USA-Road). Case studies in Domino achieve 1.47x-7.84x and 0.47x-2.52x average speedup for small-diameter graphs(e.g., com-youtube, WikiTalk, and soc-LiveJournal), over GraphChi on the Intel Core2 and Core i7 processors, respectively. Besides, compared to Intel Core i7 processors, Domino also performs significant energy-efficiency that is 2.03x-10.08x for three small-diameter graphs and 27.98x-134.50x for roadNet-CA which is a graph with relatively large diameter.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123988814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The capacity of IEEE 802.11p communication in vehicular ad hoc networks (VANETs) is widely sensitive to the tradeoff between control channel (CCH) and service channels (SCHs), which is particularly obvious in the different traffic flow condition. This paper proposes a hybrid multichannel scheduling algorithm with FPGA and traffic flow forecasting based on Kalman Filter (HMS-FFK) according to the extended SCH access mechanism mentioned in IEEE 1609.4 protocol. In HMS-FFK, a Random CCH Transmission Request Probability is defined to describe the CCH message congestion probability according to the local traffic flow density. Then, a hardware prototype of MAC sublayer management entities (MLME) based on HMS-FFK scheduling (MLME-HMS) is designed with FPGA, which is flexible to be integrated in the 802.11p communication system by the PCI interface. Theoretical analysis and simulation results show that the proposed scheme and hardware prototype of MLME are able to help IEEE 1609.4 MAC to optimize the throughput of SCHs and reduce the transmission delay of CCH in the different traffic flow condition.
{"title":"Software/Hardware Co-design for Multichannel Scheduling in IEEE 802.11p MLME: (Abstract Only)","authors":"N. Ding, Wei Zhang, Yanhua Ma, Zhen-guo Gao","doi":"10.1145/3174243.3174971","DOIUrl":"https://doi.org/10.1145/3174243.3174971","url":null,"abstract":"The capacity of IEEE 802.11p communication in vehicular ad hoc networks (VANETs) is widely sensitive to the tradeoff between control channel (CCH) and service channels (SCHs), which is particularly obvious in the different traffic flow condition. This paper proposes a hybrid multichannel scheduling algorithm with FPGA and traffic flow forecasting based on Kalman Filter (HMS-FFK) according to the extended SCH access mechanism mentioned in IEEE 1609.4 protocol. In HMS-FFK, a Random CCH Transmission Request Probability is defined to describe the CCH message congestion probability according to the local traffic flow density. Then, a hardware prototype of MAC sublayer management entities (MLME) based on HMS-FFK scheduling (MLME-HMS) is designed with FPGA, which is flexible to be integrated in the 802.11p communication system by the PCI interface. Theoretical analysis and simulation results show that the proposed scheme and hardware prototype of MLME are able to help IEEE 1609.4 MAC to optimize the throughput of SCHs and reduce the transmission delay of CCH in the different traffic flow condition.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114675286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li
Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.
{"title":"Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform","authors":"Soroosh Khoram, Jialiang Zhang, Maxwell Strange, J. Li","doi":"10.1145/3174243.3174260","DOIUrl":"https://doi.org/10.1145/3174243.3174260","url":null,"abstract":"Graph analytics, which explores the relationships among interconnected entities, is becoming increasingly important due to its broad applicability, from machine learning to social sciences. However, due to the irregular data access patterns in graph computations, one major challenge for graph processing systems is performance. The algorithms, softwares, and hardwares that have been tailored for mainstream parallel applications are generally not effective for massive, sparse graphs from the real-world problems, due to their complex and irregular structures. To address the performance issues in large-scale graph analytics, we leverage the exceptional random access performance of the emerging Hybrid Memory Cube (HMC) combined with the flexibility and efficiency of modern FPGAs. In particular, we develop a collaborative software/hardware technique to perform a level-synchronized Breadth First Search (BFS) on a FPGA-HMC platform. From the software perspective, we develop an architecture-aware graph clustering algorithm that exploits the FPGA-HMC platform»s capability to improve data locality and memory access efficiency. From the hardware perspective, we further improve the FPGA-HMC graph processor architecture by designing a memory request merging unit to take advantage of the increased data locality resulting from graph clustering. We evaluate the performance of our BFS implementation using the AC-510 development kit from Micron and achieve $2.8 times$ average performance improvement compared to the latest FPGA-HMC based graph processing system over a set of benchmarks from a wide range of applications.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114965860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}