面向流式高性能计算应用的fpga加速器的高效设计空间探索(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI:10.1145/3020078.3021767

Mostafa Koraei, Magnus Jahre, S. O. Fatemi

{"title":"面向流式高性能计算应用的fpga加速器的高效设计空间探索(仅摘要)","authors":"Mostafa Koraei, Magnus Jahre, S. O. Fatemi","doi":"10.1145/3020078.3021767","DOIUrl":null,"url":null,"abstract":"Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming HPC Applications (Abstract Only)\",\"authors\":\"Mostafa Koraei, Magnus Jahre, S. O. Fatemi\",\"doi\":\"10.1145/3020078.3021767\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.\",\"PeriodicalId\":252039,\"journal\":{\"name\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3020078.3021767\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

流式HPC应用程序是数据密集型的，在各个领域(例如，计算流体动力学和生物信息学)都有广泛的应用。这些应用程序由不同的处理内核组成，其中每个内核对其输入数据执行特定的计算。优化过程的目标是使性能最大化。fpga由于其低功耗和高理论计算能力，在加速流媒体应用方面表现出很大的希望。然而，将HPC应用程序映射到可重构结构是一项具有挑战性的任务。当应用程序需求超过资源可用性时，需要对计算内核进行临时分区，这加剧了这一挑战。在这张海报中，我们展示了一种新颖的设计方法，用于探索fpga上流HPC应用的设计空间。我们假设设计人员可以用同步数据流图(SDFG)表示目标应用程序。在SDFG中，节点是计算核，边表示核之间的数据流。设计人员还应该确定应用程序的问题大小和SDFG每个内存源上的原始数据量。我们方法的输出是一组FPGA配置，每个配置包含一个或多个SDFG节点。该方法包括三个主要步骤。在步骤1中，我们列举了有效的分区和基本配置。在步骤2中，我们找到可行的基本配置，给定可用的硬件资源和处理内核实现的库。最后，我们使用性能模型来计算步骤3中每个分区的执行时间。我们目前的假设是，以粗粒度表示SDFG是有利的，因为这可以为实际应用程序详尽地探索设计空间。这种方法已经产生了有希望的初步结果。在一个案例中，我们的方法选择的时间配置比直接映射要好3倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Towards Efficient Design Space Exploration of FPGA-based Accelerators for Streaming HPC Applications (Abstract Only)

Streaming HPC applications are data intensive and have widespread use in various fields (e.g., Computational Fluid Dynamics and Bioinformatics). These applications consist of different processing kernels where each kernel performs a specific computation on its input data. The objective of the optimization process is to maximize performance. FPGAs show great promise for accelerating streaming applications because of their low power consumption combined with high theoretical compute capabilities. However, mapping an HPC application to a reconfigurable fabric is a challenging task. The challenge is exacerbated by need to temporally partition computational kernels when application requirements exceed resource availability. In this poster, we present work towards a novel design methodology for exploring design space of streaming HPC applications on FPGAs. We assume that the designer can represent the target application with a Synchronous Data Flow Graph (SDFG). In the SDFG, the nodes are compute kernels and the edges signify data flow between kernels. The designer should also determine the problem size of the application and the volume of raw data on each memory source of the SDFG. The output of our method is a set of FPGA configurations that each contain one or more SDFG nodes. The methodology consists of three main steps. In Step 1, we enumerate the valid partitions and the base configurations. In Step 2, we find the feasible base configurations given the hardware resources available and a library of processing kernel implementations. Finally, we use a performance model to calculate the execution time of each partition in Step 3. Our current assumption is that it is advantageous to represent SDFG at a coarse granularity since this enables exhaustive exploration of the design space for practical applications. This approach has yielded promising preliminary results. In one case, the temporal configuration selected by our methodology outperformed the direct mapping by 3X.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量