一种用于大规模三维卷积的低通信方法框架

Workshop Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3547276.3548626

Anuva Kulkarni, Jelena Kovacevic, F. Franchetti

{"title":"一种用于大规模三维卷积的低通信方法框架","authors":"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti","doi":"10.1145/3547276.3548626","DOIUrl":null,"url":null,"abstract":"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.","PeriodicalId":255540,"journal":{"name":"Workshop Proceedings of the 51st International Conference on Parallel Processing","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A framework for low communication approaches for large scale 3D convolution\",\"authors\":\"Anuva Kulkarni, Jelena Kovacevic, F. Franchetti\",\"doi\":\"10.1145/3547276.3548626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.\",\"PeriodicalId\":255540,\"journal\":{\"name\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"112 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3547276.3548626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3547276.3548626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

使用并行快速傅里叶变换(fft)计算大规模三维卷积需要多个全对全的通信步骤，这对计算集群造成了瓶颈。由于从内存到内存的数据传输速度并没有随着计算能力的增加而成比例地增加(就FLOPs而言)，3D fft受到通信的限制，很难扩展，特别是在由gpu等加速器组成的现代异构计算平台上。现有的高性能计算框架侧重于优化孤立的FFT算法或通信模式，但在卷积过程中仍然需要多个全对全通信步骤。在这项工作中，我们提出了一种可扩展卷积的策略，这样它就避免了多个所有对所有的交换，并且还优化了必要的通信。我们提供了一个用例假设下的概念验证结果，MASSIF胡克定律模拟卷积核。该方法利用数据的特性实现计算的局部化，并通过数据压缩逼近卷积结果，提高了三维卷积的可扩展性。我们的初步结果表明，在相同的计算资源下，该方法的可扩展性是传统方法的8倍，而不会对结果的准确性产生不利影响。我们的方法可以适用于第一性原理的科学模拟，并利用应用程序的跨学科知识，数据和计算来执行大规模卷积，同时避免通信瓶颈。为了使我们的方法广泛可用并适应新出现的挑战，我们讨论了FFTX的使用，FFTX是一种新颖的框架，可用于与平台无关的规范和优化类似于我们的算法方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A framework for low communication approaches for large scale 3D convolution

Large-scale 3D convolutions computed using parallel Fast Fourier Transforms (FFTs) demand multiple all-to-all communication steps, which cause bottlenecks on computing clusters. Since data transfer speeds to/from memory have not increased proportionally to computational capacity (in terms of FLOPs), 3D FFTs become bounded by communication and are difficult to scale, especially on modern heterogeneous computing platforms consisting of accelerators like GPUs. Existing HPC frameworks focus on optimizing the isolated FFT algorithm or communication patterns, but still require multiple all-to-all communication steps during convolution. In this work, we present a strategy for scalable convolution such that it avoids multiple all-to-all exchanges, and also optimizes necessary communication. We provide proof-of-concept results under assumptions of a use case, the MASSIF Hooke’s law simulation convolution kernel. Our method localizes computation by exploiting properties of the data, and approximates the convolution result by data compression, resulting in increased scalability of 3D convolution. Our preliminary results show scalability of 8 times more than traditional methods in the same compute resources without adversely affecting result accuracy. Our method can be adapted for first-principle scientific simulations and leverages cross-disciplinary knowledge of the application, the data and computing to perform large-scale convolution while avoiding communication bottlenecks. In order to make our approach widely usable and adaptable for emerging challenges, we discuss the use of FFTX, a novel framework which can be used for platform-agnostic specification and optimization for algorithmic approaches similar to ours.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量

期刊最新文献

A Software/Hardware Co-design Local Irregular Sparsity Method for Accelerating CNNs on FPGA A Fast and Secure AKA Protocol for B5G Execution Flow Aware Profiling for ROS-based Autonomous Vehicle Software A User-Based Bike Return Algorithm for Docked Bike Sharing Systems Extracting High Definition Map Information from Aerial Images