C. Bobda, Taylor J. L. Whitaker, C. Kamhoua, K. Kwiat, L. Njilla
Component based design is one of the preferred methods to tackle system complexity, and reduce costs and time-to-market. Major parts of the system design and IC production are outsourced to facilities distributed across the globe, thus opening the door for malicious Trojan insertion. Hardware Sandboxing was introduce as a means to overcome the shortcomings of traditional static Trojan mitigation methods, which use intense simulation, verification, and physical tests to detect the evidence of malicious components before system deployment. The number of test patterns needed to activate with certainty potential hidden Trojans is very large for complex IPs and SoCs with dozens of inputs, outputs, states, and memory blocks, thus limiting the effectiveness of static testing methods. The rationale is to spend less effort testing pre-deployment. Instead, guards should be built around non-trusted components to catch malicious activities and prevent potential damage. While feasibility of hardware sandboxes has been proven with case studies and real-world applications, manual design was used and no systematic method was devised to automate the design process of system-on-chips that incorporate hardware sandboxes to provide high-level of security in embedded systems. In this work, we propose a method for automatic generation of hardware sandboxes in system-on-chips. Using the interface formalism of De Alfaro and Hetzinger to capture the interactions among components, along with the properties specification language to define non-authorized actions, sandboxes are generated and made ready for inclusion in a system-on-chip design. We leverage the concepts of composition, compatibility, and refinement to optimize resources across the boundary of single component and provide minimal resource consumption. With results on benchmarks implemented in FPGA, we prove that our approach can provide high-level of security, with less resource and no increase in delay.
{"title":"Automatic Generation of Hardware Sandboxes for Trojan Mitigation in Systems on Chip (Abstract Only)","authors":"C. Bobda, Taylor J. L. Whitaker, C. Kamhoua, K. Kwiat, L. Njilla","doi":"10.1145/3020078.3021774","DOIUrl":"https://doi.org/10.1145/3020078.3021774","url":null,"abstract":"Component based design is one of the preferred methods to tackle system complexity, and reduce costs and time-to-market. Major parts of the system design and IC production are outsourced to facilities distributed across the globe, thus opening the door for malicious Trojan insertion. Hardware Sandboxing was introduce as a means to overcome the shortcomings of traditional static Trojan mitigation methods, which use intense simulation, verification, and physical tests to detect the evidence of malicious components before system deployment. The number of test patterns needed to activate with certainty potential hidden Trojans is very large for complex IPs and SoCs with dozens of inputs, outputs, states, and memory blocks, thus limiting the effectiveness of static testing methods. The rationale is to spend less effort testing pre-deployment. Instead, guards should be built around non-trusted components to catch malicious activities and prevent potential damage. While feasibility of hardware sandboxes has been proven with case studies and real-world applications, manual design was used and no systematic method was devised to automate the design process of system-on-chips that incorporate hardware sandboxes to provide high-level of security in embedded systems. In this work, we propose a method for automatic generation of hardware sandboxes in system-on-chips. Using the interface formalism of De Alfaro and Hetzinger to capture the interactions among components, along with the properties specification language to define non-authorized actions, sandboxes are generated and made ready for inclusion in a system-on-chip design. We leverage the concepts of composition, compatibility, and refinement to optimize resources across the boundary of single component and provide minimal resource consumption. With results on benchmarks implemented in FPGA, we prove that our approach can provide high-level of security, with less resource and no increase in delay.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132926461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Graph Processing Applications","authors":"Nachiket Kapre","doi":"10.1145/3257190","DOIUrl":"https://doi.org/10.1145/3257190","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"47 49","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114006303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.
{"title":"RxRE: Throughput Optimization for High-Level Synthesis using Resource-Aware Regularity Extraction (Abstract Only)","authors":"A. Lotfi, Rajesh K. Gupta","doi":"10.1145/3020078.3021797","DOIUrl":"https://doi.org/10.1145/3020078.3021797","url":null,"abstract":"Despite the considerable improvements in the quality of HLS tools, they still require the designer's manual optimizations and tweaks to generate efficient results, which negates the HLS design productivity gains. Majority of designer interventions lead to optimizations that are often global in nature, for instance, finding patterns in functions that better fit a custom designed solution. We introduce a high-level resource-aware regularity extraction workflow, called RxRE that detects a class of patterns in an input program, and enhances resource sharing to balance resource usage against increased throughput. RxRE automatically detects structural patterns, or repeated sequence of floating-point operations, from sequential loops, selects suitable resources for them, and shares resources for all instances of the selected patterns. RxRE reduces required hardware area for synthesizing an instance of the program. Hence, more number of program replicas can be fitted in the fixed area budget of an FPGA. RxRE contributes to a pre-synthesis workflow that exploits the inherent regularity of applications to achieve higher computational throughput using off-the-shelf HLS tools without any changes to the HLS flow. It uses a string-based pattern detection approach to find linear patterns across loops within the same function. It deploys a simple but effective model to estimate resource utilization and latency of each candidate design, to avoid synthesizing every possible design alternative. We have implemented and evaluated RxRE using a set of C benchmarks. The synthesis results on a Xilinx Virtex FPGA show that the reduced area of the transformed programs improves the number of mapped kernels by a factor of 1.54X on average (maximum 2.8X) which yields on average 1.59X (maximum 2.4X) higher throughput over Xilinx Vivado HLS tool solution. Current implementation has several limitations and only extracts a special case of regularity that is subject of current optimization and study.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117217995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach that employs memory-aware force directed scheduling and multi-cycle memory partitioning is formally proposed to achieve legitimate pipelining kernel and valid bank mapping with less resource consumption and optimal pipelining performance. The experimental results over a variety of benchmarks show that our approach can achieve the optimal pipelining performance and meanwhile reduce the number of multiple independent memory banks by 55.1% on average, compared with the state-of-the-art approaches.
{"title":"Joint Modulo Scheduling and Memory Partitioning with Multi-Bank Memory for High-Level Synthesis (Abstract Only)","authors":"Tianyi Lu, S. Yin, Xianqing Yao, Zhicong Xie, Leibo Liu, Shaojun Wei","doi":"10.1145/3020078.3021778","DOIUrl":"https://doi.org/10.1145/3020078.3021778","url":null,"abstract":"High-Level Synthesis (HLS) has been widely recognized and accepted as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach that employs memory-aware force directed scheduling and multi-cycle memory partitioning is formally proposed to achieve legitimate pipelining kernel and valid bank mapping with less resource consumption and optimal pipelining performance. The experimental results over a variety of benchmarks show that our approach can achieve the optimal pipelining performance and meanwhile reduce the number of multiple independent memory banks by 55.1% on average, compared with the state-of-the-art approaches.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127219059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have seen an increased deployment of FPGAs as programmable accelerators for improving the performance and energy efficiency of compute-intensive applications. A well-known "secret sauce" of achieving highly efficient FPGA acceleration is to create application-specific memory architecture that fully exploits the vast amounts of on-chip memory bandwidth provided by the reconfigurable fabric. In particular, memory banking is widely employed when multiple parallel memory accesses are needed to meet a demanding throughput constraint. In this paper we propose TraceBanking, a novel and flexible trace-driven address mining algorithm that can automatically generate efficient memory banking schemes by analyzing a stream of memory address bits. Unlike mainstream memory partitioning techniques that are based on static compile-time analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions. More importantly, our technique can effectively handle memory traces that exhibit either affine or non-affine access patterns, and produce efficient banking solutions with a reasonable runtime. Furthermore, TraceBanking can be used to process a reduced memory trace with the aid of an SMT prover to verify if the resulting banking scheme is indeed conflict free. Our experiments on Xilinx FPGAs show that TraceBanking achieves competitive performance and resource usage compared to the state-of-the-art across a set of real-life benchmarks with affine memory accesses. We also perform a case study on a face detection algorithm to show that TraceBanking is capable of generating a highly area-efficient memory partitioning based on a sequence of addresses without any obvious access patterns.
{"title":"A New Approach to Automatic Memory Banking using Trace-Based Address Mining","authors":"Yuan Zhou, Khalid Al-Hawaj, Zhiru Zhang","doi":"10.1145/3020078.3021734","DOIUrl":"https://doi.org/10.1145/3020078.3021734","url":null,"abstract":"Recent years have seen an increased deployment of FPGAs as programmable accelerators for improving the performance and energy efficiency of compute-intensive applications. A well-known \"secret sauce\" of achieving highly efficient FPGA acceleration is to create application-specific memory architecture that fully exploits the vast amounts of on-chip memory bandwidth provided by the reconfigurable fabric. In particular, memory banking is widely employed when multiple parallel memory accesses are needed to meet a demanding throughput constraint. In this paper we propose TraceBanking, a novel and flexible trace-driven address mining algorithm that can automatically generate efficient memory banking schemes by analyzing a stream of memory address bits. Unlike mainstream memory partitioning techniques that are based on static compile-time analysis, TraceBanking only relies on simple source-level instrumentation to provide the memory trace of interest without enforcing any coding restrictions. More importantly, our technique can effectively handle memory traces that exhibit either affine or non-affine access patterns, and produce efficient banking solutions with a reasonable runtime. Furthermore, TraceBanking can be used to process a reduced memory trace with the aid of an SMT prover to verify if the resulting banking scheme is indeed conflict free. Our experiments on Xilinx FPGAs show that TraceBanking achieves competitive performance and resource usage compared to the state-of-the-art across a set of real-life benchmarks with affine memory accesses. We also perform a case study on a face detection algorithm to show that TraceBanking is capable of generating a highly area-efficient memory partitioning based on a sequence of addresses without any obvious access patterns.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121495521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iterative stencil algorithms find applications in a wide range of domains. FPGAs have long been adopted for computation acceleration due to its advantages of dedicated hardware design. Hence, FPGAs are a compelling alternative for executing iterative stencil algorithms. However, efficient implementation of iterative stencil algorithms on FPGAs is very challenging due to the data dependencies between iterations and elements in the stencil algorithms, programming hurdle of FPGAs, and large design space. In this paper, we present a comprehensive framework that synthesizes iterative stencil algorithms on FPGAs efficiently. We leverage the OpenCL-to-FPGA tool chain to generate accelerator automatically and perform design space exploration at high level. We propose to bridge the neighboring tiles through pipe and enable data sharing among them to improve computation efficiency. We first propose a homogeneous design with equal tile size. Then, we extend to a heterogeneous design with different tile size to balance the computation among different tiles. Our designs exhibit a large design space in terms of tile structure. We also develop analytical performance models to explore the complex design space. Experiments using a wide range of stencil applications demonstrate that on average our homogeneous and heterogeneous implementations achieve 1.49X and 1.65X performance speedup respectively but with less hardware resource compared to the state-of-the-art.
{"title":"A Framework for Iterative Stencil Algorithm Synthesis on FPGAs from OpenCL Programming Model (Abstract Only)","authors":"Shuo Wang, Yun Liang","doi":"10.1145/3020078.3021761","DOIUrl":"https://doi.org/10.1145/3020078.3021761","url":null,"abstract":"Iterative stencil algorithms find applications in a wide range of domains. FPGAs have long been adopted for computation acceleration due to its advantages of dedicated hardware design. Hence, FPGAs are a compelling alternative for executing iterative stencil algorithms. However, efficient implementation of iterative stencil algorithms on FPGAs is very challenging due to the data dependencies between iterations and elements in the stencil algorithms, programming hurdle of FPGAs, and large design space. In this paper, we present a comprehensive framework that synthesizes iterative stencil algorithms on FPGAs efficiently. We leverage the OpenCL-to-FPGA tool chain to generate accelerator automatically and perform design space exploration at high level. We propose to bridge the neighboring tiles through pipe and enable data sharing among them to improve computation efficiency. We first propose a homogeneous design with equal tile size. Then, we extend to a heterogeneous design with different tile size to balance the computation among different tiles. Our designs exhibit a large design space in terms of tile structure. We also develop analytical performance models to explore the complex design space. Experiments using a wide range of stencil applications demonstrate that on average our homogeneous and heterogeneous implementations achieve 1.49X and 1.65X performance speedup respectively but with less hardware resource compared to the state-of-the-art.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122197839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Fu, Conghui He, Huabin Ruan, Itay Greenspon, W. Luk, Yongkang Zheng, Junfeng Liao, Qing Zhang, Guangwen Yang
The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.
{"title":"Accelerating Financial Market Server through Hybrid List Design (Abstract Only)","authors":"H. Fu, Conghui He, Huabin Ruan, Itay Greenspon, W. Luk, Yongkang Zheng, Junfeng Liao, Qing Zhang, Guangwen Yang","doi":"10.1145/3020078.3021775","DOIUrl":"https://doi.org/10.1145/3020078.3021775","url":null,"abstract":"The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116379301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Panel: FPGAs in the Cloud","authors":"G. Constantinides","doi":"10.1145/3257188","DOIUrl":"https://doi.org/10.1145/3257188","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128737200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. ChethanKumarH., P. Ravi, G. Modi, Nachiket Kapre
We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested RTL source code for the microAptiv MIPS processor made available under the Imagination Technologies Academic Program. We augment the processor with suitable custom instruction extensions for moving data between the cores via explicit message passing. We support these instructions with a communication scratchpad that is optimized for high throughput injection of network traffic. We also demonstrate an end-to-end proof of-concept flow that compiles C code with suitable MIPS UDI-supported (user-defined instructions) message passing workloads and stress-test with synthetic workloads.
我们设计了一个120核94MHz MIPS处理器FPGA覆盖层,与适合Stratix V GX FPGA (5SGXEA7N2F45C2)的轻量级消息传递结构互连。我们使用经过硅测试的RTL源代码用于microAptiv MIPS处理器,该处理器是在Imagination Technologies学术计划下提供的。我们用适当的自定义指令扩展来增强处理器,以便通过显式消息传递在内核之间移动数据。我们用一个针对高吞吐量网络流量注入进行了优化的通信便签本来支持这些指令。我们还演示了一个端到端的概念验证流程,该流程使用合适的MIPS udi支持(用户定义指令)消息传递工作负载编译C代码,并使用合成工作负载进行压力测试。
{"title":"120-core microAptiv MIPS Overlay for the Terasic DE5-NET FPGA board","authors":"B. ChethanKumarH., P. Ravi, G. Modi, Nachiket Kapre","doi":"10.1145/3020078.3021751","DOIUrl":"https://doi.org/10.1145/3020078.3021751","url":null,"abstract":"We design a 120-core 94MHz MIPS processor FPGA over-lay interconnected with a lightweight message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested RTL source code for the microAptiv MIPS processor made available under the Imagination Technologies Academic Program. We augment the processor with suitable custom instruction extensions for moving data between the cores via explicit message passing. We support these instructions with a communication scratchpad that is optimized for high throughput injection of network traffic. We also demonstrate an end-to-end proof of-concept flow that compiles C code with suitable MIPS UDI-supported (user-defined instructions) message passing workloads and stress-test with synthetic workloads.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125132755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.
{"title":"Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming HPC Applications (Abstract Only)","authors":"Mostafa Koraei, Magnus Jahre, S. O. Fatemi","doi":"10.1145/3020078.3021767","DOIUrl":"https://doi.org/10.1145/3020078.3021767","url":null,"abstract":"Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123412636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}