CUDA-on- cl:用于在OpenCL™1.2设备上运行NVIDIA®CUDA™c++ 11应用程序的编译器和运行时

Hugh Perkins
{"title":"CUDA-on- cl:用于在OpenCL™1.2设备上运行NVIDIA®CUDA™c++ 11应用程序的编译器和运行时","authors":"Hugh Perkins","doi":"10.1145/3078155.3078156","DOIUrl":null,"url":null,"abstract":"In the machine learning domain, machine learning frameworks are predominantly written and maintained in NVIDIA® CUDA™ language. There have been attempts to port these frameworks to OpenCL®, notably the ports of Caffe framework by Gu et al; Tschopp; and Engel; and of Torch framework by Perkins. The authors of these frameworks found merging their work into the mainstream framework challenging, and maintain their forks as separate branches or repositories. CUDA-on-CL addresses this problem by leaving the reference implementation entirely in NVIDIA CUDA, both host-side and device-side, and providing a compiler and a runtime component, so that any CUDA C++11 application can in theory be compiled and run on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run unary, binary and reduction Tensorflow and Eigen kernels, with no modification to the original CUDA source-code. Performance studies are undertaken, using the Tensorflow kernels. For buffer sizes of 1MB or more, performance is comparable between CUDA and CUDA-on-CL, across unary operations, binary operations and single-axis reductions. Full reduction is around 14 times slower on CUDA-on-CL than on CUDA. We think this may be because of the absence of the low-level hardware shfl operation. The asymptotic time for zero buffer sizes is double that of CUDA, possibly because of the overhead of additional kernel boilerplate needed to workaround limitations in the OpenCL 1.2 standard.","PeriodicalId":267581,"journal":{"name":"Proceedings of the 5th International Workshop on OpenCL","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices\",\"authors\":\"Hugh Perkins\",\"doi\":\"10.1145/3078155.3078156\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the machine learning domain, machine learning frameworks are predominantly written and maintained in NVIDIA® CUDA™ language. There have been attempts to port these frameworks to OpenCL®, notably the ports of Caffe framework by Gu et al; Tschopp; and Engel; and of Torch framework by Perkins. The authors of these frameworks found merging their work into the mainstream framework challenging, and maintain their forks as separate branches or repositories. CUDA-on-CL addresses this problem by leaving the reference implementation entirely in NVIDIA CUDA, both host-side and device-side, and providing a compiler and a runtime component, so that any CUDA C++11 application can in theory be compiled and run on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run unary, binary and reduction Tensorflow and Eigen kernels, with no modification to the original CUDA source-code. Performance studies are undertaken, using the Tensorflow kernels. For buffer sizes of 1MB or more, performance is comparable between CUDA and CUDA-on-CL, across unary operations, binary operations and single-axis reductions. Full reduction is around 14 times slower on CUDA-on-CL than on CUDA. We think this may be because of the absence of the low-level hardware shfl operation. The asymptotic time for zero buffer sizes is double that of CUDA, possibly because of the overhead of additional kernel boilerplate needed to workaround limitations in the OpenCL 1.2 standard.\",\"PeriodicalId\":267581,\"journal\":{\"name\":\"Proceedings of the 5th International Workshop on OpenCL\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th International Workshop on OpenCL\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3078155.3078156\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on OpenCL","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078155.3078156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

在机器学习领域,机器学习框架主要使用NVIDIA®CUDA™语言编写和维护。已经有人尝试将这些框架移植到OpenCL®,特别是Gu等人对Caffe框架的移植;Tschopp;和恩格尔;以及Perkins的Torch框架。这些框架的作者发现将他们的工作合并到主流框架中是一项挑战,并将他们的分支作为单独的分支或存储库进行维护。CUDA-on- cl通过将参考实现完全保留在NVIDIA CUDA(主机端和设备端)中解决了这个问题,并提供了编译器和运行时组件,因此任何CUDA c++ 11应用程序理论上都可以在任何OpenCL 1.2设备上编译和运行。我们使用Tensorflow框架作为案例研究,并演示了在不修改原始CUDA源代码的情况下运行一元,二进制和约简Tensorflow和特征核的能力。使用Tensorflow核进行性能研究。对于1MB或更大的缓冲区大小,CUDA和CUDA-on- cl之间的性能在一元操作、二进制操作和单轴缩减方面是相当的。完全还原在CUDA-on- cl上比在CUDA上慢14倍左右。我们认为这可能是因为缺少底层硬件shfl操作。零缓冲区大小的渐近时间是CUDA的两倍,可能是因为需要额外的内核样板的开销来解决OpenCL 1.2标准中的限制。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CUDA-on-CL: a compiler and runtime for running NVIDIA® CUDA™ C++11 applications on OpenCL™ 1.2 Devices
In the machine learning domain, machine learning frameworks are predominantly written and maintained in NVIDIA® CUDA™ language. There have been attempts to port these frameworks to OpenCL®, notably the ports of Caffe framework by Gu et al; Tschopp; and Engel; and of Torch framework by Perkins. The authors of these frameworks found merging their work into the mainstream framework challenging, and maintain their forks as separate branches or repositories. CUDA-on-CL addresses this problem by leaving the reference implementation entirely in NVIDIA CUDA, both host-side and device-side, and providing a compiler and a runtime component, so that any CUDA C++11 application can in theory be compiled and run on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run unary, binary and reduction Tensorflow and Eigen kernels, with no modification to the original CUDA source-code. Performance studies are undertaken, using the Tensorflow kernels. For buffer sizes of 1MB or more, performance is comparable between CUDA and CUDA-on-CL, across unary operations, binary operations and single-axis reductions. Full reduction is around 14 times slower on CUDA-on-CL than on CUDA. We think this may be because of the absence of the low-level hardware shfl operation. The asymptotic time for zero buffer sizes is double that of CUDA, possibly because of the overhead of additional kernel boilerplate needed to workaround limitations in the OpenCL 1.2 standard.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms Modeling Explicit SIMD Programming With Subgroup Functions OpenCL Interoperability with OpenVX Graphs Challenges and Opportunities in Native GPU Debugging OpenCL in Scientific High Performance Computing: The Good, the Bad, and the Ugly
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1