通过宏数据流图转换循环链

Proceedings of the 2018 International Symposium on Code Generation and Optimization Pub Date : 2018-02-24 DOI:10.1145/3168832

Eddie C. Davis, M. Strout, C. Olschanowsky

{"title":"通过宏数据流图转换循环链","authors":"Eddie C. Davis, M. Strout, C. Olschanowsky","doi":"10.1145/3168832","DOIUrl":null,"url":null,"abstract":"This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.","PeriodicalId":103558,"journal":{"name":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","volume":"316 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Transforming loop chains via macro dataflow graphs\",\"authors\":\"Eddie C. Davis, M. Strout, C. Olschanowsky\",\"doi\":\"10.1145/3168832\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.\",\"PeriodicalId\":103558,\"journal\":{\"name\":\"Proceedings of the 2018 International Symposium on Code Generation and Optimization\",\"volume\":\"316 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 International Symposium on Code Generation and Optimization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3168832\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Symposium on Code Generation and Optimization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3168832","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

本文描述了一种使用修改的宏数据流图进行性能优化的方法，其中包含表示循环和涉及模板计算的数据的节点。目标应用包括现有的包含一系列共享数据的模板计算的科学应用，即循环链。通过修改执行计划，可以提高模板应用程序的性能。然而，现代体系结构越来越受到内存子系统带宽的限制。为了充分实现调度更改对改进局域性的好处，还必须最小化临时存储分配。我们提出了一个宏数据流图变体，它包括数据集节点，一个量化给定图所需的内存交互的成本模型，一组可以在图上执行的转换，如融合和平铺，以及一种生成代码来实现转换后的图的方法。我们包括与Halide和PolyMage实现基准的性能比较。我们最快的变体比两个框架生成的自动调优变体性能更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Transforming loop chains via macro dataflow graphs

This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized. We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 International Symposium on Code Generation and Optimization

自引率

0.00%

发文量

期刊最新文献

High performance stencil code generation with Lift DeLICM: scalar dependence removal at zero memory cost Local memory-aware kernel perforation Analyzing and optimizing task granularity on the JVM Lightweight detection of cache conflicts