High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu
{"title":"High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures","authors":"Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu","doi":"arxiv-2408.00278","DOIUrl":null,"url":null,"abstract":"Convolution is the core component within deep neural networks and it is\ncomputationally intensive and time consuming. Tensor data layouts significantly\nimpact convolution operations in terms of memory access and computational\nefficiency. Yet, there is still a lack of comprehensive performance\ncharacterization on data layouts on SIMD architectures concerning convolution\nmethods. This paper proposes three novel data layouts for im2win convolution:\nNHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques\nfor both direct and im2win convolutions. We compare the optimized im2win\nconvolution with the direct convolution and PyTorch's im2col-based convolution\nacross the aforementioned layouts on SIMD machines. The experiments\ndemonstrated that the im2win convolution with the new NHWC layout achieved up\nto 355% performance speedup over NCHW layout. Our optimizations also\nsignificantly improve the performance of both im2win and direct convolutions.\nOur optimized im2win and direct convolutions achieved up to 95% and 94% of\nmachine's theoretical peak performance, respectively.","PeriodicalId":501347,"journal":{"name":"arXiv - CS - Neural and Evolutionary Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Neural and Evolutionary Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00278","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在 SIMD 架构上使用三种张量布局实现高性能 Im2win 和直接卷积
卷积是深度神经网络的核心组件,其计算密集且耗时。张量数据布局在内存访问和计算效率方面极大地影响了卷积操作。然而,关于卷积方法的 SIMD 架构上的数据布局,仍然缺乏全面的性能描述。本文提出了三种新颖的 im2win 卷积数据布局:NHWC、CHWN 和 CHWN8,并介绍了一套针对直接卷积和 im2win 卷积的通用优化技术。我们比较了经过优化的 im2win 卷积与直接卷积以及 PyTorch 基于 im2col 的卷积在 SIMD 机器上的上述布局。实验证明,采用新的 NHWC 布局的 im2win 卷积比 NCHW 布局的性能提高了 355%。我们的优化还显著提高了im2win卷积和直接卷积的性能,优化后的im2win卷积和直接卷积分别达到了机器理论峰值性能的95%和94%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hardware-Friendly Implementation of Physical Reservoir Computing with CMOS-based Time-domain Analog Spiking Neurons Self-Contrastive Forward-Forward Algorithm Bio-Inspired Mamba: Temporal Locality and Bioplausible Learning in Selective State Space Models PReLU: Yet Another Single-Layer Solution to the XOR Problem Inferno: An Extensible Framework for Spiking Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1