High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

arXiv - CS - Neural and Evolutionary Computing Pub Date : 2024-08-01 DOI:arxiv-2408.00278

Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu

{"title":"High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures","authors":"Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu","doi":"arxiv-2408.00278","DOIUrl":null,"url":null,"abstract":"Convolution is the core component within deep neural networks and it is\ncomputationally intensive and time consuming. Tensor data layouts significantly\nimpact convolution operations in terms of memory access and computational\nefficiency. Yet, there is still a lack of comprehensive performance\ncharacterization on data layouts on SIMD architectures concerning convolution\nmethods. This paper proposes three novel data layouts for im2win convolution:\nNHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques\nfor both direct and im2win convolutions. We compare the optimized im2win\nconvolution with the direct convolution and PyTorch's im2col-based convolution\nacross the aforementioned layouts on SIMD machines. The experiments\ndemonstrated that the im2win convolution with the new NHWC layout achieved up\nto 355% performance speedup over NCHW layout. Our optimizations also\nsignificantly improve the performance of both im2win and direct convolutions.\nOur optimized im2win and direct convolutions achieved up to 95% and 94% of\nmachine's theoretical peak performance, respectively.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":"98 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在 SIMD 架构上使用三种张量布局实现高性能 Im2win 和直接卷积

卷积是深度神经网络的核心组件，其计算密集且耗时。张量数据布局在内存访问和计算效率方面极大地影响了卷积操作。然而，关于卷积方法的 SIMD 架构上的数据布局，仍然缺乏全面的性能描述。本文提出了三种新颖的 im2win 卷积数据布局：NHWC、CHWN 和 CHWN8，并介绍了一套针对直接卷积和 im2win 卷积的通用优化技术。我们比较了经过优化的 im2win 卷积与直接卷积以及 PyTorch 基于 im2col 的卷积在 SIMD 机器上的上述布局。实验证明，采用新的 NHWC 布局的 im2win 卷积比 NCHW 布局的性能提高了 355%。我们的优化还显著提高了im2win卷积和直接卷积的性能，优化后的im2win卷积和直接卷积分别达到了机器理论峰值性能的95%和94%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Neural and Evolutionary Computing

自引率

0.00%

发文量