Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach
{"title":"移动gpu专用直接卷积的自动生成","authors":"Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach","doi":"10.1145/3366428.3380771","DOIUrl":null,"url":null,"abstract":"Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.","PeriodicalId":266831,"journal":{"name":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Automatic generation of specialized direct convolutions for mobile GPUs\",\"authors\":\"Naums Mogers, Valentin Radu, Lu Li, Jack Turner, M. O’Boyle, Christophe Dubach\",\"doi\":\"10.1145/3366428.3380771\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.\",\"PeriodicalId\":266831,\"journal\":{\"name\":\"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3366428.3380771\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3366428.3380771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
摘要
卷积神经网络(cnn)是一种强大而通用的工具,用于在资源受限的设置和服务器端应用程序中执行计算机视觉任务。大多数GPU硬件供应商为cnn提供高度调优的库,如Nvidia的cuDNN或ARM计算库。这些库是高级、常用的机器学习框架(如PyTorch或Caffe)的基础,将它们从特定于供应商的实现细节中抽象出来。然而,为gpu编写优化的并行代码远非易事。这给特定于硬件的库编写者带来了沉重的负担,他们必须不断地追赶硬件和网络的快速发展。为了减少工作量和缩短上市时间,需要基于自动代码生成的新方法,而不是手动实现。本文用一种新的数据并行中间语言和编译器Lift描述了这种直接卷积的方法。Lift使用高级中间语言来表达算法,然后使用重写规则系统自动优化算法。与机器学习框架通常使用的矩阵乘法方法相反,直接卷积使用的内存少了一个数量级,这对移动设备至关重要。使用Lift,我们展示了自动生成比直接卷积快X10的代码是可能的,而使用X3.6的空间比最新一代ARM Mali GPU上非常专业的ARM计算库的基于gem的卷积少。
Automatic generation of specialized direct convolutions for mobile GPUs
Convolutional Neural Networks (CNNs) are a powerful and versatile tool for performing computer vision tasks in both resource constrained settings and server-side applications. Most GPU hardware vendors provide highly tuned libraries for CNNs such as Nvidia's cuDNN or ARM Compute Library. Such libraries are the basis for higher-level, commonly-used, machine-learning frameworks such as PyTorch or Caffe, abstracting them away from vendor-specific implementation details. However, writing optimized parallel code for GPUs is far from trivial. This places a significant burden on hardware-specific library writers which have to continually play catch-up with rapid hardware and network evolution. To reduce effort and reduce time to market, new approaches are needed based on automatic code generation, rather than manual implementation. This paper describes such an approach for direct convolutions using Lift, a new data-parallel intermediate language and compiler. Lift uses a high-level intermediate language to express algorithms which are then automatically optimized using a system of rewrite-rules. Direct convolution, as opposed to the matrix multiplication approach used commonly by machine-learning frameworks, uses an order of magnitude less memory, which is critical for mobile devices. Using Lift, we show that it is possible to generate automatically code that is X10 faster than the direct convolution while using X3.6 less space than the GEMM-based convolution of the very specialized ARM Compute Library on the latest generation of ARM Mali GPU.