首页 > 最新文献

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文 中文
Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only) 基于FPGA的深度可分离卷积自动优化CNN(摘要)
Ruizhe Zhao, Xinyu Niu, W. Luk
Convolution layers in Convolutional Neural Networks (CNNs) are effective in vision feature extraction but quite inefficient in computational resource usage. Depthwise separable convolution layer has been proposed in recent publications to enhance the efficiency without reducing the effectiveness by separately computing the spatial and cross-channel correlations from input images and has proven successful in state-of-the-art networks such as MobileNets [1] and Xception [2]. Based on the facts that depthwise separable convolution is highly structured and uses limited resources, we argue that it can well fit reconfigurable platforms like FPGA. To benefit FPGA platforms with this new layer, in this paper, we present a novel framework that can automatically generate and optimise hardware designs for depthwise separable CNNs. Besides, in our framework, existing conventional CNNs can be systematically converted to ones whose standard convolution layers are selectively replaced with functionally identical depthwise separable convolution layers, by carefully balancing the trade-off among speed, accuracy, and resource usage through resource usage modelling and network fine-tuning. Results show that hardware designs generated by our framework can reach at most 231.7 frames per second regarding MobileNets, and for VGG-16 [3], we gain 3.43 times speed-up and 3.54% accuracy decrease on the ImageNet [4] dataset comparing the original model and a layer replaced one.
卷积神经网络(cnn)中的卷积层在视觉特征提取方面是有效的,但在计算资源利用方面效率低下。深度可分离卷积层在最近的出版物中被提出,通过分别计算输入图像的空间和跨通道相关性来提高效率而不降低有效性,并在MobileNets[1]和Xception[2]等最先进的网络中被证明是成功的。基于深度可分离卷积具有高度结构化和使用有限资源的特点,我们认为它可以很好地适应FPGA等可重构平台。为了使FPGA平台受益于这一新层,在本文中,我们提出了一个新的框架,可以自动生成和优化深度可分离cnn的硬件设计。此外,在我们的框架中,通过资源使用建模和网络微调,仔细平衡速度、精度和资源使用之间的权衡,现有的传统cnn可以系统地转换为有选择性地将标准卷积层替换为功能相同的深度可分卷积层的cnn。结果表明,我们的框架生成的硬件设计在mobilenet上最多可以达到231.7帧/秒,对于vgg16[3],在ImageNet[4]数据集上,与原始模型和替换层相比,我们的速度提高了3.43倍,精度降低了3.54%。
{"title":"Automatic Optimising CNN with Depthwise Separable Convolution on FPGA: (Abstact Only)","authors":"Ruizhe Zhao, Xinyu Niu, W. Luk","doi":"10.1145/3174243.3174959","DOIUrl":"https://doi.org/10.1145/3174243.3174959","url":null,"abstract":"Convolution layers in Convolutional Neural Networks (CNNs) are effective in vision feature extraction but quite inefficient in computational resource usage. Depthwise separable convolution layer has been proposed in recent publications to enhance the efficiency without reducing the effectiveness by separately computing the spatial and cross-channel correlations from input images and has proven successful in state-of-the-art networks such as MobileNets [1] and Xception [2]. Based on the facts that depthwise separable convolution is highly structured and uses limited resources, we argue that it can well fit reconfigurable platforms like FPGA. To benefit FPGA platforms with this new layer, in this paper, we present a novel framework that can automatically generate and optimise hardware designs for depthwise separable CNNs. Besides, in our framework, existing conventional CNNs can be systematically converted to ones whose standard convolution layers are selectively replaced with functionally identical depthwise separable convolution layers, by carefully balancing the trade-off among speed, accuracy, and resource usage through resource usage modelling and network fine-tuning. Results show that hardware designs generated by our framework can reach at most 231.7 frames per second regarding MobileNets, and for VGG-16 [3], we gain 3.43 times speed-up and 3.54% accuracy decrease on the ImageNet [4] dataset comparing the original model and a layer replaced one.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127688616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA 基于FPGA的二维和三维cnn加速统一模板架构研究
Junzhong Shen, Y. Huang, Zelong Wang, Yuran Qiao, M. Wen, Chunyuan Zhang
Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on FPGA. We find accelerating 3D CNNs on FPGA to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple FPGA platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.
三维卷积神经网络(3D cnn)在许多计算机视觉应用中得到了有效的应用。在这一领域的大多数工作只集中在设计和优化2D CNN的加速器上,很少尝试在FPGA上加速3D CNN。我们发现在FPGA上加速3D cnn由于其高计算复杂度和存储需求是一个挑战。更重要的是,尽管2D和3D CNN的计算模式是相似的,但传统的加速2D CNN的方法可能不适用于3D CNN的加速。为了使用统一的框架加速2D和3D CNN,本文提出了一种统一的基于模板的架构,该架构使用基于Winograd算法的模板来保证2D和3D CNN加速器的快速开发。此外,我们还开发了一个统一的分析模型,以促进基于我们的架构的二维和三维CNN加速器的有效设计空间探索。最后,我们通过在多个FPGA平台上为现实生活中的2D和3D cnn (VGG16和C3D)实现加速器来证明基于模板的架构的有效性。在S2C VUS440上,VGG16和C3D在低资源利用率下分别达到1.13 TOPS和1.11 TOPS。与CPU和GPU解决方案的端到端比较表明,与CPU解决方案相比,我们的C3D实现了高达13倍和60倍的性能和能量增益,并且比GPU解决方案获得了6.4倍的能效增益。
{"title":"Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA","authors":"Junzhong Shen, Y. Huang, Zelong Wang, Yuran Qiao, M. Wen, Chunyuan Zhang","doi":"10.1145/3174243.3174257","DOIUrl":"https://doi.org/10.1145/3174243.3174257","url":null,"abstract":"Three-dimensional convolutional neural networks (3D CNNs) are used efficiently in many computer vision applications. Most previous work in this area has concentrated only on designing and optimizing accelerators for 2D CNN, with few attempts made to accelerate 3D CNN on FPGA. We find accelerating 3D CNNs on FPGA to be challenge due to their high computational complexity and storage demands. More importantly, although the computation patterns of 2D and 3D CNNs are analogous, the conventional approaches adopted for accelerating 2D CNNs may be unfit for 3D CNN acceleration. In this paper, in order to accelerate 2D and 3D CNNs using a uniform framework, we propose a uniform template-based architecture that uses templates based on the Winograd algorithm to ensure fast development of 2D and 3D CNN accelerators. Furthermore, we also develop a uniform analytical model to facilitate efficient design space explorations of 2D and 3D CNN accelerators based on our architecture. Finally, we demonstrate the effectiveness of the template-based architecture by implementing accelerators for real-life 2D and 3D CNNs (VGG16 and C3D) on multiple FPGA platforms. On S2C VUS440, we achieve up to 1.13 TOPS and 1.11 TOPS under low resource utilization for VGG16 and C3D, respectively. End-to-end comparisons with CPU and GPU solutions demonstrate that our implementation of C3D achieves gains of up to 13x and 60x in performance and energy relative to a CPU solution, and a 6.4x energy efficiency gain over a GPU solution.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127727528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Solving Satisfiability Problem on Quantum Annealer: A Lesson from FPGA CAD Tools: (Abstract Only) 求解量子退火的可满足性问题:FPGA CAD工具的启示(摘要)
J. Su, Lei He
Recently, a practical quantum annealing device has been commercialized by D-Wave Systems, sparking research interest in developing applications to solve problems that are intractable for classical computer. This paper provides a tutorial for using quantum annealer to solve Boolean satisfiability problem. We explain the computational model of quantum annealer and discuss the detailed mapping technique inspired by FPGA CAD flow, including stages such as logic optimization, placement and routing.
最近,一种实用的量子退火装置已经由D-Wave系统商业化,激发了研究兴趣,开发应用程序来解决经典计算机难以解决的问题。本文提供了一个应用量子退火法求解布尔可满足性问题的教程。我们解释了量子退火的计算模型,并讨论了受FPGA CAD流程启发的详细映射技术,包括逻辑优化,放置和路由等阶段。
{"title":"Solving Satisfiability Problem on Quantum Annealer: A Lesson from FPGA CAD Tools: (Abstract Only)","authors":"J. Su, Lei He","doi":"10.1145/3174243.3174972","DOIUrl":"https://doi.org/10.1145/3174243.3174972","url":null,"abstract":"Recently, a practical quantum annealing device has been commercialized by D-Wave Systems, sparking research interest in developing applications to solve problems that are intractable for classical computer. This paper provides a tutorial for using quantum annealer to solve Boolean satisfiability problem. We explain the computational model of quantum annealer and discuss the detailed mapping technique inspired by FPGA CAD flow, including stages such as logic optimization, placement and routing.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134441833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation 基于SDC和SAT联合公式的可扩展精确资源约束调度方法
Steve Dai, Gai Liu, Zhiru Zhang
Despite increasing adoption of high-level synthesis (HLS) for its design productivity advantage, success in achieving high quality-of-results out-of-the-box is often hindered by the inexactness of the common HLS optimizations. In particular, while scheduling forms the algorithmic core to HLS technology, current scheduling algorithms rely heavily on fundamentally inexact heuristics that make ad hoc local decisions and cannot accurately and globally optimize over a rich set of constraints. To tackle this challenge, we propose a scheduling formulation based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle a variety of scheduling constraints. We develop a specialized scheduler based on conflict-driven learning and problem-specific knowledge to optimally and efficiently solve the resource-constrained scheduling problem. By leveraging the efficiency of SDC algorithms and scalability of modern SAT solvers, our scheduling technique is able to achieve on average over 100x improvement in runtime over the integer linear programming (ILP) approach while attaining optimal latency. By integrating our scheduling formulation into a state-of-the-art open-source HLS tool, we further demonstrate the applicability of our scheduling technique with a suite of representative benchmarks targeting FPGAs.
尽管越来越多的人采用高级综合(HLS),因为它具有设计生产力优势,但是在获得开箱即用的高质量结果方面的成功常常受到普通HLS优化的不精确性的阻碍。特别是,虽然调度构成了HLS技术的算法核心,但当前的调度算法严重依赖于基本上不精确的启发式,这些启发式会做出临时的局部决策,无法在一组丰富的约束条件下准确地进行全局优化。为了解决这一问题,我们提出了一种基于整数差分约束系统(SDC)和布尔可满足性(SAT)的调度公式来精确处理各种调度约束。我们开发了一种基于冲突驱动学习和问题特定知识的专用调度程序,以最优和有效地解决资源受限的调度问题。通过利用SDC算法的效率和现代SAT求解器的可扩展性,我们的调度技术能够实现比整数线性规划(ILP)方法平均运行时间提高100倍以上,同时获得最佳延迟。通过将我们的调度公式集成到最先进的开源HLS工具中,我们通过一套针对fpga的代表性基准进一步证明了我们的调度技术的适用性。
{"title":"A Scalable Approach to Exact Resource-Constrained Scheduling Based on a Joint SDC and SAT Formulation","authors":"Steve Dai, Gai Liu, Zhiru Zhang","doi":"10.1145/3174243.3174268","DOIUrl":"https://doi.org/10.1145/3174243.3174268","url":null,"abstract":"Despite increasing adoption of high-level synthesis (HLS) for its design productivity advantage, success in achieving high quality-of-results out-of-the-box is often hindered by the inexactness of the common HLS optimizations. In particular, while scheduling forms the algorithmic core to HLS technology, current scheduling algorithms rely heavily on fundamentally inexact heuristics that make ad hoc local decisions and cannot accurately and globally optimize over a rich set of constraints. To tackle this challenge, we propose a scheduling formulation based on system of integer difference constraints (SDC) and Boolean satisfiability (SAT) to exactly handle a variety of scheduling constraints. We develop a specialized scheduler based on conflict-driven learning and problem-specific knowledge to optimally and efficiently solve the resource-constrained scheduling problem. By leveraging the efficiency of SDC algorithms and scalability of modern SAT solvers, our scheduling technique is able to achieve on average over 100x improvement in runtime over the integer linear programming (ILP) approach while attaining optimal latency. By integrating our scheduling formulation into a state-of-the-art open-source HLS tool, we further demonstrate the applicability of our scheduling technique with a suite of representative benchmarks targeting FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only) fpga在数据中心中的应用:并行混合标量字符串样本排序(pHS5)(仅摘要)
Mikhail Asiatici, Damian Maiorano, P. Ienne
String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. We present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.
字符串排序是数据库和MapReduce应用程序的重要组成部分;然而,它还没有像定长键排序那样得到广泛的研究。在硬件中处理可变长度的键是具有挑战性的,因此在FPGA上没有提出字符串排序器也就不足为奇了。我们在Intel HARPv2上提出并行混合标量字符串样本排序(pHS5), HARPv2是一种异构CPU- fpga系统,具有服务器级多核CPU。我们的pHS5基于用于多核共享内存cpu的最先进的字符串排序算法pS5,我们在FPGA上扩展了多个处理元素(pe)。与运行在3.4 GHz的单个Intel至强Broadwell内核相比,每个PE可将pS5最有效并行化主导内核的一个实例加速高达33%。此外,我们扩展了pS5的作业调度机制,使我们的pe能够与CPU内核竞争处理可加速内核,同时保留了CPU上复杂的高级控制流和较小数据集的排序。与在14核至强处理器上运行的28线程软件基准相比,我们将整个算法的速度提高了10%,在线程数较低的情况下,速度提高了36%。
{"title":"FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort (pHS5)(Abstract Only)","authors":"Mikhail Asiatici, Damian Maiorano, P. Ienne","doi":"10.1145/3174243.3174993","DOIUrl":"https://doi.org/10.1145/3174243.3174993","url":null,"abstract":"String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. We present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127114012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL 基于OpenCL的fpga高性能模板计算空间和时间组合块
H. Zohouri, Artur Podobas, S. Matsuoka
Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.
高级合成工具的最新发展吸引了软件程序员在fpga上加速他们的高性能计算应用。尽管已经证明FPGA可以在模板计算的性能方面与gpu竞争,但大多数先前的工作通过避免空间阻塞和限制相对于FPGA片上存储器的输入尺寸来实现这一点。在这项工作中,我们使用英特尔FPGA SDK为OpenCL创建了一个模板加速器,该加速器在没有此类限制的情况下实现了高性能。我们结合空间和时间阻塞来避免输入大小限制,并采用多个fpga特定的优化来解决由于增加的设计复杂性而产生的问题。加速器参数调整由我们的性能模型指导,我们也使用该模型来预测即将推出的英特尔Stratix 10设备的性能。在Arria 10 GX 1150设备上,我们的加速器可以分别达到760和375 GFLOP/s的计算性能,用于2D和3D模板,可与高度优化的GPU实现的性能相媲美。此外,我们估计即将推出的Stratix 10器件可以分别实现高达3.5 TFLOP/s和1.6 TFLOP/s的2D和3D模板计算性能。
{"title":"Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL","authors":"H. Zohouri, Artur Podobas, S. Matsuoka","doi":"10.1145/3174243.3174248","DOIUrl":"https://doi.org/10.1145/3174243.3174248","url":null,"abstract":"Recent developments in High Level Synthesis tools have attracted software programmers to accelerate their high-performance computing applications on FPGAs. Even though it has been shown that FPGAs can compete with GPUs in terms of performance for stencil computation, most previous work achieve this by avoiding spatial blocking and restricting input dimensions relative to FPGA on-chip memory. In this work we create a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions. We combine spatial and temporal blocking to avoid input size restrictions, and employ multiple FPGA-specific optimizations to tackle issues arisen from the added design complexity. Accelerator parameter tuning is guided by our performance model, which we also use to project performance for the upcoming Intel Stratix 10 devices. On an Arria 10 GX 1150 device, our accelerator can reach up to 760 and 375 GFLOP/s of compute performance, for 2D and 3D stencils, respectively, which rivals the performance of a highly-optimized GPU implementation. Furthermore, we estimate that the upcoming Stratix 10 devices can achieve a performance of up to 3.5 TFLOP/s and 1.6 TFLOP/s for 2D and 3D stencil computation, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122156146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs fpga中低延迟100gb /s流数据包解析器的p4兼容高级合成
Jeferson Santiago da Silva, F. Boyer, J. Langlois
Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated textttC++ classes. The pipeline layout is derived from a parser graph that corresponds to a P4 code after a series of graph transformation rounds. The RTL code is generated from the textttC++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves a SI100 gigabit/second data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art.
数据包解析是sdn感知设备的关键步骤。为了支持不断发展的网络协议和不断增长的千兆位数据速率,SDN网络中的数据包解析器需要既可重构又快速。包处理语言与fpga的结合似乎是这些需求的完美匹配。在这项工作中,我们开发了一个开源的基于fpga的可配置架构,用于SDN网络中的任意数据包解析。我们直接从数据包处理程序生成低延迟和高速流数据包解析器。我们的体系结构是流水线的,并完全使用模板化的textt++类建模。管道布局是从一个解析器图派生出来的,该解析器图在经过一系列图转换之后对应于一个P4代码。RTL代码是使用Xilinx Vivado HLS从 textttc++描述生成的,并使用Xilinx Vivado进行合成。我们的架构在Xilinx Virtex-7 FPGA中实现了SI100 千兆比特/秒的数据速率,同时与最先进的技术相比,延迟降低了45%,LUT使用率降低了40%。
{"title":"P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1145/3174243.3174270","DOIUrl":"https://doi.org/10.1145/3174243.3174270","url":null,"abstract":"Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated textttC++ classes. The pipeline layout is derived from a parser graph that corresponds to a P4 code after a series of graph transformation rounds. The RTL code is generated from the textttC++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves a SI100 gigabit/second data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124098410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 2018 ACM/SIGDA现场可编程门阵列国际研讨会论文集
{"title":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","authors":"","doi":"10.1145/3174243","DOIUrl":"https://doi.org/10.1145/3174243","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132891872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1