利用Nvidia gpu的半精度算法

2017 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2017-09-01 DOI:10.1109/HPEC.2017.8091072

Nhut-Minh Ho, W. Wong

{"title":"利用Nvidia gpu的半精度算法","authors":"Nhut-Minh Ho, W. Wong","doi":"10.1109/HPEC.2017.8091072","DOIUrl":null,"url":null,"abstract":"With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings. In this paper, we developed an automated conversion framework to help users migrate their CUDA code to better exploit Pascal's half precision capability. Using our tools and techniques, we successfully convert many benchmarks from single precision arithmetic to half precision equivalent, and achieved significant speedup improvement in many cases. In the best case, a 3× speedup over the FP32 version was achieved. We shall also discuss some new issues and opportunities that the Pascal GPUs brought.","PeriodicalId":364903,"journal":{"name":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"Exploiting half precision arithmetic in Nvidia GPUs\",\"authors\":\"Nhut-Minh Ho, W. Wong\",\"doi\":\"10.1109/HPEC.2017.8091072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings. In this paper, we developed an automated conversion framework to help users migrate their CUDA code to better exploit Pascal's half precision capability. Using our tools and techniques, we successfully convert many benchmarks from single precision arithmetic to half precision equivalent, and achieved significant speedup improvement in many cases. In the best case, a 3× speedup over the FP32 version was achieved. We shall also discuss some new issues and opportunities that the Pascal GPUs brought.\",\"PeriodicalId\":364903,\"journal\":{\"name\":\"2017 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC.2017.8091072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2017.8091072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

随着深度学习和节能近似计算的日益重要，半精度浮点算法(FP16)迅速得到普及。英伟达最近的Pascal架构是第一个提供FP16支持的GPU。然而，当实际产品发布时，程序员很快意识到naïve用半精度替换单精度(FP32)代码会导致令人失望的性能结果，即使他们愿意忍受精度降低带来的误差增加。在本文中，我们开发了一个自动转换框架来帮助用户迁移他们的CUDA代码，以更好地利用Pascal的半精度能力。使用我们的工具和技术，我们成功地将许多基准从单精度算法转换为半精度等效算法，并在许多情况下取得了显着的加速改进。在最好的情况下，比FP32版本实现了3倍的加速。我们还将讨论Pascal gpu带来的一些新问题和机遇。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploiting half precision arithmetic in Nvidia GPUs

With the growing importance of deep learning and energy-saving approximate computing, half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent Pascal architecture was the first GPU that offered FP16 support. However, when actual products were shipped, programmers soon realized that a naïve replacement of single precision (FP32) code with half precision led to disappointing performance results, even if they are willing to tolerate the increase in error precision reduction brings. In this paper, we developed an automated conversion framework to help users migrate their CUDA code to better exploit Pascal's half precision capability. Using our tools and techniques, we successfully convert many benchmarks from single precision arithmetic to half precision equivalent, and achieved significant speedup improvement in many cases. In the best case, a 3× speedup over the FP32 version was achieved. We shall also discuss some new issues and opportunities that the Pascal GPUs brought.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量

期刊最新文献

Optimized task graph mapping on a many-core neuromorphic supercomputer Software-defined extreme scale networks for bigdata applications Power-aware computing: Measurement, control, and performance analysis for Intel Xeon Phi xDCI, a data science cyberinfrastructure for interdisciplinary research Leakage energy reduction for hard real-time caches