FPGA Acceleration of Irregular Iterative Computations using Criticality-Aware Dataflow Optimizations (Abstract Only)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI:10.1145/2684746.2689110

Siddhartha, Nachiket Kapre

{"title":"FPGA Acceleration of Irregular Iterative Computations using Criticality-Aware Dataflow Optimizations (Abstract Only)","authors":"Siddhartha, Nachiket Kapre","doi":"10.1145/2684746.2689110","DOIUrl":null,"url":null,"abstract":"FPGA acceleration of large irregular dataflow graphs is often limited by the long tail distribution of parallelism on fine-grained overlay dataflow architectures. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically reassociate the high-fanin dataflow chains by providing faster routes for late arriving inputs. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. Effectively these transformations reduce the length of the tail in the parallelism profile for these large-scale graphs. Across a range of dataflow benchmarks extracted from Sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static pre-processing alone, a 2.4× (mean 1.17×) improvement when using only dynamic optimizations and an overall 2.9× (mean 1.39×) improvement when both static and dynamic optimizations are enabled. These improvements are on top of 3--10× speedups over CPU implementations without our transformation enabled.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2689110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

FPGA acceleration of large irregular dataflow graphs is often limited by the long tail distribution of parallelism on fine-grained overlay dataflow architectures. In this paper, we show how to overcome these limitations by exploiting criticality information along compute paths; both statically during graph pre-processing and dynamically at runtime. We statically reassociate the high-fanin dataflow chains by providing faster routes for late arriving inputs. We also perform a fanout decomposition and selective node replication in order to distribute serialization costs across multiple PEs. Additionally, we modify the dataflow firing rule in hardware to prefer critical nodes when multiple nodes are ready for dataflow evaluation. Effectively these transformations reduce the length of the tail in the parallelism profile for these large-scale graphs. Across a range of dataflow benchmarks extracted from Sparse LU factorization, we demonstrate up to 2.5× (mean 1.21×) improvement when using the static pre-processing alone, a 2.4× (mean 1.17×) improvement when using only dynamic optimizations and an overall 2.9× (mean 1.39×) improvement when both static and dynamic optimizations are enabled. These improvements are on top of 3--10× speedups over CPU implementations without our transformation enabled.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于临界感知数据流优化的FPGA非规则迭代计算加速(仅摘要)

FPGA对大型不规则数据流图的加速常常受到细粒度覆盖数据流架构上并行度长尾分布的限制。在本文中，我们展示了如何通过利用沿计算路径的临界信息来克服这些限制;静态地在图形预处理期间，动态地在运行时。我们通过为延迟到达的输入提供更快的路由，静态地重新关联高fanin数据流链。我们还执行扇出分解和选择性节点复制，以便在多个pe之间分配序列化成本。此外，我们修改了硬件中的数据流触发规则，以便在多个节点准备好进行数据流评估时优先选择关键节点。这些转换有效地减少了这些大规模图的并行性轮廓中尾部的长度。在从稀疏LU分解提取的一系列数据流基准测试中，我们展示了仅使用静态预处理时的2.5倍(平均1.21倍)改进，仅使用动态优化时的2.4倍(平均1.17倍)改进，以及同时启用静态和动态优化时的2.9倍(平均1.39倍)改进。这些改进是在没有启用转换的情况下，比CPU实现加速3- 10倍的基础上进行的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量

期刊最新文献

REPROC: A Dynamically Reconfigurable Architecture for Symmetric Cryptography (Abstract Only) Session details: Technical Session 1: Computer-aided Design Energy-Efficient Discrete Signal Processing with Field Programmable Analog Arrays (FPAAs) Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA Impact of Memory Architecture on FPGA Energy Consumption