Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献_第3页

Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip (Abstract Only) 基于芯片系统的防木马硬件沙箱的自动生成(摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021774

C. Bobda, Taylor J. L. Whitaker, C. Kamhoua, K. Kwiat, L. Njilla

Component based design is one of the preferred methods to tackle system complexity, and reduce costs and time-to-market. Major parts of the system design and IC production are outsourced to facilities distributed across the globe, thus opening the door for malicious Trojan insertion. Hardware Sandboxing was introduce as a means to overcome the shortcomings of traditional static Trojan mitigation methods, which use intense simulation, verification, and physical tests to detect the evidence of malicious components before system deployment. The number of test patterns needed to activate with certainty potential hidden Trojans is very large for complex IPs and SoCs with dozens of inputs, outputs, states, and memory blocks, thus limiting the effectiveness of static testing methods. The rationale is to spend less effort testing pre-deployment. Instead, guards should be built around non-trusted components to catch malicious activities and prevent potential damage. While feasibility of hardware sandboxes has been proven with case studies and real-world applications, manual design was used and no systematic method was devised to automate the design process of system-on-chips that incorporate hardware sandboxes to provide high-level of security in embedded systems. In this work, we propose a method for automatic generation of hardware sandboxes in system-on-chips. Using the interface formalism of De Alfaro and Hetzinger to capture the interactions among components, along with the properties specification language to define non-authorized actions, sandboxes are generated and made ready for inclusion in a system-on-chip design. We leverage the concepts of composition, compatibility, and refinement to optimize resources across the boundary of single component and provide minimal resource consumption. With results on benchmarks implemented in FPGA, we prove that our approach can provide high-level of security, with less resource and no increase in delay.

基于组件的设计是解决系统复杂性、降低成本和缩短上市时间的首选方法之一。系统设计和集成电路生产的主要部分外包给分布在全球各地的设施，从而打开了恶意木马植入的大门。硬件沙箱是一种克服传统静态木马缓解方法缺点的手段，该方法在系统部署前使用密集的仿真、验证和物理测试来检测恶意组件的证据。对于具有数十个输入、输出、状态和内存块的复杂ip和soc来说，激活潜在隐藏木马所需的测试模式数量非常大，从而限制了静态测试方法的有效性。其基本原理是花费更少的精力测试预部署。相反，应该围绕不受信任的组件构建防护，以捕获恶意活动并防止潜在的损害。虽然硬件沙箱的可行性已经通过案例研究和实际应用得到了证明，但仍然使用了手动设计，并且没有设计出系统的方法来自动化芯片上系统的设计过程，这些系统包含硬件沙箱，以在嵌入式系统中提供高级安全性。在这项工作中，我们提出了一种在片上系统中自动生成硬件沙箱的方法。使用De Alfaro和Hetzinger的接口形式化来捕获组件之间的交互，并使用属性规范语言来定义未经授权的操作，生成并准备好将沙箱包含在片上系统设计中。我们利用组合、兼容性和细化的概念来优化跨单个组件边界的资源，并提供最小的资源消耗。通过在FPGA上实现的基准测试结果，我们证明了我们的方法可以提供高水平的安全性，资源更少，延迟不会增加。

{"title":"Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip (Abstract Only)","authors":"C. Bobda, Taylor J. L. Whitaker, C. Kamhoua, K. Kwiat, L. Njilla","doi":"10.1145/3020078.3021774","DOIUrl":"https://doi.org/10.1145/3020078.3021774","url":null,"abstract":"Component based design is one of the preferred methods to tackle system complexity, and reduce costs and time-to-market. Major parts of the system design and IC production are outsourced to facilities distributed across the globe, thus opening the door for malicious Trojan insertion. Hardware Sandboxing was introduce as a means to overcome the shortcomings of traditional static Trojan mitigation methods, which use intense simulation, verification, and physical tests to detect the evidence of malicious components before system deployment. The number of test patterns needed to activate with certainty potential hidden Trojans is very large for complex IPs and SoCs with dozens of inputs, outputs, states, and memory blocks, thus limiting the effectiveness of static testing methods. The rationale is to spend less effort testing pre-deployment. Instead, guards should be built around non-trusted components to catch malicious activities and prevent potential damage. While feasibility of hardware sandboxes has been proven with case studies and real-world applications, manual design was used and no systematic method was devised to automate the design process of system-on-chips that incorporate hardware sandboxes to provide high-level of security in embedded systems. In this work, we propose a method for automatic generation of hardware sandboxes in system-on-chips. Using the interface formalism of De Alfaro and Hetzinger to capture the interactions among components, along with the properties specification language to define non-authorized actions, sandboxes are generated and made ready for inclusion in a system-on-chip design. We leverage the concepts of composition, compatibility, and refinement to optimize resources across the boundary of single component and provide minimal resource consumption. With results on benchmarks implemented in FPGA, we prove that our approach can provide high-level of security, with less resource and no increase in delay.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132926461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Session details: Graph Processing Applications 会话详细信息:图形处理应用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257190

Nachiket Kapre

引用次数: 0

RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction (Abstract Only) RxRE:基于资源感知规则提取的高级合成吞吐量优化(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021797

A. Lotfi, Rajesh K. Gupta

Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.

尽管HLS工具的质量有了相当大的提高，但它们仍然需要设计师手动优化和调整以产生有效的结果，这抵消了HLS设计生产力的提高。大多数设计人员的干预导致的优化通常是全局的，例如，在功能中找到更适合定制设计的解决方案的模式。我们引入了一个高级的资源感知规则提取工作流，称为RxRE，它检测输入程序中的一类模式，并增强资源共享以平衡资源使用和增加的吞吐量。RxRE从顺序循环中自动检测结构模式或重复的浮点操作序列，为它们选择合适的资源，并为所选模式的所有实例共享资源。RxRE减少了合成程序实例所需的硬件面积。因此，在FPGA的固定面积预算中可以容纳更多的程序副本。RxRE有助于预合成工作流，该工作流利用应用程序的固有规律性，使用现成的HLS工具实现更高的计算吞吐量，而无需对HLS流进行任何更改。它使用基于字符串的模式检测方法来查找同一函数中跨循环的线性模式。它部署了一个简单但有效的模型来估计每个候选设计的资源利用率和延迟，以避免综合所有可能的设计方案。我们已经使用一组C基准来实现和评估RxRE。在Xilinx Virtex FPGA上的合成结果表明，与Xilinx Vivado HLS工具解决方案相比，转换后的程序面积减少后，映射核的数量平均提高了1.54倍(最大2.8倍)，吞吐量平均提高了1.59倍(最大2.4倍)。目前的实现有几个限制，并且只提取了当前优化和研究的主题的一个特殊的规则情况。

{"title":"RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction (Abstract Only)","authors":"A. Lotfi, Rajesh K. Gupta","doi":"10.1145/3020078.3021797","DOIUrl":"https://doi.org/10.1145/3020078.3021797","url":null,"abstract":"Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117217995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level Synthesis (Abstract Only) 面向高级综合的多库内存联合模调度和内存分区(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021778

Tianyi Lu, S. Yin, Xianqing Yao, Zhicong Xie, Leibo Liu, Shaojun Wei

High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach that employs memory-aware force directed scheduling and multi-cycle memory partitioning is formally proposed to achieve legitimate pipelining kernel and valid bank mapping with less resource consumption and optimal pipelining performance. The experimental results over a variety of benchmarks show that our approach can achieve the optimal pipelining performance and meanwhile reduce the number of multiple independent memory banks by 55.1% on average, compared with the state-of-the-art approaches.

高阶综合(High-Level Synthesis, HLS)作为一种针对fpga的高效编译过程，已被广泛认可和接受，用于算法评估和产品原型设计。然而，大规模并行内存访问需求和具有多端口的单银行内存的极其昂贵的成本阻碍了循环流水线的性能。在此基础上，正式提出了一种采用内存感知强制定向调度和多周期内存分区的联合方法，以较少的资源消耗和最佳的流水线性能实现合法的流水线内核和有效的流水线映射。各种基准测试的实验结果表明，与目前的方法相比，我们的方法可以实现最佳的流水线性能，同时将多个独立内存库的数量平均减少55.1%。

引用次数: 0

A New Approach to Automatic Memory Banking using Trace-Based Address Mining 基于跟踪地址挖掘的自动内存存储新方法

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021734

Yuan Zhou, Khalid Al-Hawaj, Zhiru Zhang

Recent years have seen an increased deployment of FPGAs as programmable accelerators for improving the performance and energy efficiency of compute-intensive applications. A well-known "secret sauce" of achieving highly efficient FPGA acceleration is to create application-specific memory architecture that fully exploits the vast amounts of on-chip memory bandwidth provided by the reconfigurable fabric. In particular, memory banking is widely employed when multiple parallel memory accesses are needed to meet a demanding throughput constraint. In this paper we propose TraceBanking, a novel and flexible trace-driven address mining algorithm that can automatically generate efficient memory banking schemes by analyzing a stream of memory address bits. Unlike mainstream memory partitioning techniques that are based on static compile-time analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions. More importantly, our technique can effectively handle memory traces that exhibit either affine or non-affine access patterns, and produce efficient banking solutions with a reasonable runtime. Furthermore, TraceBanking can be used to process a reduced memory trace with the aid of an SMT prover to verify if the resulting banking scheme is indeed conflict free. Our experiments on Xilinx FPGAs show that TraceBanking achieves competitive performance and resource usage compared to the state-of-the-art across a set of real-life benchmarks with affine memory accesses. We also perform a case study on a face detection algorithm to show that TraceBanking is capable of generating a highly area-efficient memory partitioning based on a sequence of addresses without any obvious access patterns.

近年来，fpga作为可编程加速器的部署越来越多，用于提高计算密集型应用的性能和能源效率。实现高效FPGA加速的一个众所周知的“秘诀”是创建特定于应用程序的内存架构，充分利用可重构结构提供的大量片上内存带宽。特别是，当需要多个并行内存访问以满足苛刻的吞吐量约束时，内存银行被广泛使用。在本文中，我们提出了一种新颖而灵活的跟踪驱动地址挖掘算法TraceBanking，它可以通过分析内存地址位流来自动生成高效的内存银行方案。与基于静态编译时分析的主流内存分区技术不同，TraceBanking仅依赖于简单的源代码级插装来提供感兴趣的内存跟踪，而不强制任何编码限制。更重要的是，我们的技术可以有效地处理显示仿射或非仿射访问模式的内存跟踪，并产生具有合理运行时的高效银行解决方案。此外，可以使用TraceBanking在SMT证明程序的帮助下处理减少的内存跟踪，以验证生成的银行方案是否确实没有冲突。我们在Xilinx fpga上的实验表明，与一组具有仿射内存访问的实际基准测试相比，TraceBanking实现了具有竞争力的性能和资源使用。我们还对人脸检测算法进行了案例研究，以表明TraceBanking能够基于地址序列生成高度区域高效的内存分区，而无需任何明显的访问模式。

{"title":"A New Approach to Automatic Memory Banking using Trace-Based Address Mining","authors":"Yuan Zhou, Khalid Al-Hawaj, Zhiru Zhang","doi":"10.1145/3020078.3021734","DOIUrl":"https://doi.org/10.1145/3020078.3021734","url":null,"abstract":"Recent years have seen an increased deployment of FPGAs as programmable accelerators for improving the performance and energy efficiency of compute-intensive applications. A well-known \"secret sauce\" of achieving highly efficient FPGA acceleration is to create application-specific memory architecture that fully exploits the vast amounts of on-chip memory bandwidth provided by the reconfigurable fabric. In particular, memory banking is widely employed when multiple parallel memory accesses are needed to meet a demanding throughput constraint. In this paper we propose TraceBanking, a novel and flexible trace-driven address mining algorithm that can automatically generate efficient memory banking schemes by analyzing a stream of memory address bits. Unlike mainstream memory partitioning techniques that are based on static compile-time analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions. More importantly, our technique can effectively handle memory traces that exhibit either affine or non-affine access patterns, and produce efficient banking solutions with a reasonable runtime. Furthermore, TraceBanking can be used to process a reduced memory trace with the aid of an SMT prover to verify if the resulting banking scheme is indeed conflict free. Our experiments on Xilinx FPGAs show that TraceBanking achieves competitive performance and resource usage compared to the state-of-the-art across a set of real-life benchmarks with affine memory accesses. We also perform a case study on a face detection algorithm to show that TraceBanking is capable of generating a highly area-efficient memory partitioning based on a sequence of addresses without any obvious access patterns.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming Model (Abstract Only) 基于OpenCL编程模型的fpga迭代模板算法综合框架(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021761

Shuo Wang, Yun Liang

Iterative stencil algorithms find applications in a wide range of domains. FPGAs have long been adopted for computation acceleration due to its advantages of dedicated hardware design. Hence, FPGAs are a compelling alternative for executing iterative stencil algorithms. However, efficient implementation of iterative stencil algorithms on FPGAs is very challenging due to the data dependencies between iterations and elements in the stencil algorithms, programming hurdle of FPGAs, and large design space. In this paper, we present a comprehensive framework that synthesizes iterative stencil algorithms on FPGAs efficiently. We leverage the OpenCL-to-FPGA tool chain to generate accelerator automatically and perform design space exploration at high level. We propose to bridge the neighboring tiles through pipe and enable data sharing among them to improve computation efficiency. We first propose a homogeneous design with equal tile size. Then, we extend to a heterogeneous design with different tile size to balance the computation among different tiles. Our designs exhibit a large design space in terms of tile structure. We also develop analytical performance models to explore the complex design space. Experiments using a wide range of stencil applications demonstrate that on average our homogeneous and heterogeneous implementations achieve 1.49X and 1.65X performance speedup respectively but with less hardware resource compared to the state-of-the-art.

迭代模板算法在许多领域都有应用。fpga由于其专用硬件设计的优势，长期以来被用于计算加速。因此，fpga是执行迭代模板算法的一个引人注目的替代方案。然而，由于迭代与模板算法中元素之间的数据依赖关系、fpga的编程障碍以及较大的设计空间，在fpga上有效地实现迭代模板算法是非常具有挑战性的。在本文中，我们提出了一个综合框架，有效地综合了fpga上的迭代模板算法。我们利用OpenCL-to-FPGA工具链自动生成加速器，并在高层次上进行设计空间探索。我们提出通过管道桥接相邻的瓦片，实现瓦片之间的数据共享，以提高计算效率。我们首先提出一个均匀的设计，具有相同的瓷砖大小。然后，我们扩展到不同瓷砖大小的异构设计，以平衡不同瓷砖之间的计算。我们的设计在瓷砖结构上展现了很大的设计空间。我们还开发了分析性能模型来探索复杂的设计空间。使用广泛的模板应用程序的实验表明，平均而言，我们的同构和异构实现分别实现了1.49倍和1.65倍的性能加速，但与最先进的技术相比，硬件资源更少。

{"title":"A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming Model (Abstract Only)","authors":"Shuo Wang, Yun Liang","doi":"10.1145/3020078.3021761","DOIUrl":"https://doi.org/10.1145/3020078.3021761","url":null,"abstract":"Iterative stencil algorithms find applications in a wide range of domains. FPGAs have long been adopted for computation acceleration due to its advantages of dedicated hardware design. Hence, FPGAs are a compelling alternative for executing iterative stencil algorithms. However, efficient implementation of iterative stencil algorithms on FPGAs is very challenging due to the data dependencies between iterations and elements in the stencil algorithms, programming hurdle of FPGAs, and large design space. In this paper, we present a comprehensive framework that synthesizes iterative stencil algorithms on FPGAs efficiently. We leverage the OpenCL-to-FPGA tool chain to generate accelerator automatically and perform design space exploration at high level. We propose to bridge the neighboring tiles through pipe and enable data sharing among them to improve computation efficiency. We first propose a homogeneous design with equal tile size. Then, we extend to a heterogeneous design with different tile size to balance the computation among different tiles. Our designs exhibit a large design space in terms of tile structure. We also develop analytical performance models to explore the complex design space. Experiments using a wide range of stencil applications demonstrate that on average our homogeneous and heterogeneous implementations achieve 1.49X and 1.65X performance speedup respectively but with less hardware resource compared to the state-of-the-art.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122197839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Accelerating Financial Market Server through Hybrid List Design (Abstract Only) 通过混合列表设计加速金融市场服务器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021775

H. Fu, Conghui He, Huabin Ruan, Itay Greenspon, W. Luk, Yongkang Zheng, Junfeng Liao, Qing Zhang, Guangwen Yang

The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.

交易所中的金融市场服务器旨在维护订单簿，并向交易者提供实时市场数据。低延迟处理在金融交易中有很大的需求。尽管软件解决方案提供了在高级编程模型中表达算法和快速重新编译的灵活性，但由于响应时间长且不可预测，它正变得越来越没有竞争力。如今，现场可编程门阵列(fpga)已被证明是一种成熟的技术，可以实现以硬件加速方式处理流数据包的低延迟和恒定延迟。然而，在FPGA上维护订单簿涉及到将数据包组织成gb的结构数据信息以及复杂的例程(排序，插入，删除等)，这对FPGA设计在设计方法和内存容量方面都极具挑战性。因此，现有的FPGA设计通常将后处理部分留给cpu。然而，它在很大程度上抵消了网络数据包处理部分的延迟增益。本文提出了一种CPU-FPGA混合列表设计，以加速金融市场服务器实现微秒级延迟。本文主要包括四个方面的贡献。首先，我们设计了一个具有两层的CPU-FPGA混合列表，FPGA上的小缓存列表和CPU主机上的大主列表。这两个列表使用不同的排序方案进行排序，其中双元排序应用于缓存列表，而平衡树用于维护主列表。其次，为了有效地更新混合排序表，我们推导了一套完整的低延迟例程，包括插入、删除、选择、排序等，提供了几个周期规模的低延迟。第三，我们提出了一种非阻塞的按需同步策略，使缓存列表和主列表能够相互通信。最后，我们将混合列表以及其他组件，如分组拆分、解析、处理等，整合成一个行业级的金融市场服务器。我们的设计应用于中国金融期货交易所(CFFEX)的环境中，通过每天运行600多个小时，数以亿计的数据包，展示了其功能和稳定性。与CFFEX现有的基于cpu的解决方案相比，我们的系统能够支持相同的功能，同时显着将延迟从100多微秒减少到2微秒，获得50倍的速度提升。

{"title":"Accelerating Financial Market Server through Hybrid List Design (Abstract Only)","authors":"H. Fu, Conghui He, Huabin Ruan, Itay Greenspon, W. Luk, Yongkang Zheng, Junfeng Liao, Qing Zhang, Guangwen Yang","doi":"10.1145/3020078.3021775","DOIUrl":"https://doi.org/10.1145/3020078.3021775","url":null,"abstract":"The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116379301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Session details: Panel: FPGAs in the Cloud 专题讨论:云中的fpga

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257188

G. Constantinides

引用次数: 0

120-core microAptiv MIPS Overlay for the Terasic DE5-NET FPGA board 用于Terasic DE5-NET FPGA板的120核microAptiv MIPS Overlay

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021751

B. ChethanKumarH., P. Ravi, G. Modi, Nachiket Kapre

We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested RTL source code for the microAptiv MIPS processor made available under the Imagination Technologies Academic Program. We augment the processor with suitable custom instruction extensions for moving data between the cores via explicit message passing. We support these instructions with a communication scratchpad that is optimized for high throughput injection of network traffic. We also demonstrate an end-to-end proof of-concept flow that compiles C code with suitable MIPS UDI-supported (user-defined instructions) message passing workloads and stress-test with synthetic workloads.

我们设计了一个120核94MHz MIPS处理器FPGA覆盖层，与适合Stratix V GX FPGA (5SGXEA7N2F45C2)的轻量级消息传递结构互连。我们使用经过硅测试的RTL源代码用于microAptiv MIPS处理器，该处理器是在Imagination Technologies学术计划下提供的。我们用适当的自定义指令扩展来增强处理器，以便通过显式消息传递在内核之间移动数据。我们用一个针对高吞吐量网络流量注入进行了优化的通信便签本来支持这些指令。我们还演示了一个端到端的概念验证流程，该流程使用合适的MIPS udi支持(用户定义指令)消息传递工作负载编译C代码，并使用合成工作负载进行压力测试。

引用次数: 12

Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming HPC Applications (Abstract Only) 面向流式高性能计算应用的fpga加速器的高效设计空间探索(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021767

Mostafa Koraei, Magnus Jahre, S. O. Fatemi

Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.

流式HPC应用程序是数据密集型的，在各个领域(例如，计算流体动力学和生物信息学)都有广泛的应用。这些应用程序由不同的处理内核组成，其中每个内核对其输入数据执行特定的计算。优化过程的目标是使性能最大化。fpga由于其低功耗和高理论计算能力，在加速流媒体应用方面表现出很大的希望。然而，将HPC应用程序映射到可重构结构是一项具有挑战性的任务。当应用程序需求超过资源可用性时，需要对计算内核进行临时分区，这加剧了这一挑战。在这张海报中，我们展示了一种新颖的设计方法，用于探索fpga上流HPC应用的设计空间。我们假设设计人员可以用同步数据流图(SDFG)表示目标应用程序。在SDFG中，节点是计算核，边表示核之间的数据流。设计人员还应该确定应用程序的问题大小和SDFG每个内存源上的原始数据量。我们方法的输出是一组FPGA配置，每个配置包含一个或多个SDFG节点。该方法包括三个主要步骤。在步骤1中，我们列举了有效的分区和基本配置。在步骤2中，我们找到可行的基本配置，给定可用的硬件资源和处理内核实现的库。最后，我们使用性能模型来计算步骤3中每个分区的执行时间。我们目前的假设是，以粗粒度表示SDFG是有利的，因为这可以为实际应用程序详尽地探索设计空间。这种方法已经产生了有希望的初步结果。在一个案例中，我们的方法选择的时间配置比直接映射要好3倍。

{"title":"Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming HPC Applications (Abstract Only)","authors":"Mostafa Koraei, Magnus Jahre, S. O. Fatemi","doi":"10.1145/3020078.3021767","DOIUrl":"https://doi.org/10.1145/3020078.3021767","url":null,"abstract":"Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123412636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0