Automatic generation of specialized direct convolutions for mobile GPUs

Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach
{"title":"Automatic generation of specialized direct convolutions for mobile GPUs","authors":"Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach","doi":"10.1145/3366428.3380771","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366428.3380771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
移动gpu专用直接卷积的自动生成
卷积神经网络(cnn)是一种强大而通用的工具,用于在资源受限的设置和服务器端应用程序中执行计算机视觉任务。大多数GPU硬件供应商为cnn提供高度调优的库,如Nvidia的cuDNN或ARM计算库。这些库是高级、常用的机器学习框架(如PyTorch或Caffe)的基础,将它们从特定于供应商的实现细节中抽象出来。然而,为gpu编写优化的并行代码远非易事。这给特定于硬件的库编写者带来了沉重的负担,他们必须不断地追赶硬件和网络的快速发展。为了减少工作量和缩短上市时间,需要基于自动代码生成的新方法,而不是手动实现。本文用一种新的数据并行中间语言和编译器Lift描述了这种直接卷积的方法。Lift使用高级中间语言来表达算法,然后使用重写规则系统自动优化算法。与机器学习框架通常使用的矩阵乘法方法相反,直接卷积使用的内存少了一个数量级,这对移动设备至关重要。使用Lift,我们展示了自动生成比直接卷积快X10的代码是可能的,而使用X3.6的空间比最新一代ARM Mali GPU上非常专业的ARM计算库的基于gem的卷积少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The Minos Computing Library: efficient parallel programming for extremely heterogeneous systems GPGPU performance estimation for frequency scaling using cross-benchmarking Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL Automatic generation of specialized direct convolutions for mobile GPUs High-level hardware feature extraction for GPU performance prediction of stencils
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1