基于经验优化的GPU基数排序方法

Bonan Huang, Jinlan Gao, Xiaoming Li
{"title":"基于经验优化的GPU基数排序方法","authors":"Bonan Huang, Jinlan Gao, Xiaoming Li","doi":"10.1109/ISPA.2009.89","DOIUrl":null,"url":null,"abstract":"Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\\cite{Whaley}, FFTW~\\cite{FFTW_org}, SPIRAL~\\cite{Pueschel:05} and X-Sort~\\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\\%.","PeriodicalId":346815,"journal":{"name":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"An Empirically Optimized Radix Sort for GPU\",\"authors\":\"Bonan Huang, Jinlan Gao, Xiaoming Li\",\"doi\":\"10.1109/ISPA.2009.89\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\\\\cite{Whaley}, FFTW~\\\\cite{FFTW_org}, SPIRAL~\\\\cite{Pueschel:05} and X-Sort~\\\\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\\\\%.\",\"PeriodicalId\":346815,\"journal\":{\"name\":\"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPA.2009.89\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Parallel and Distributed Processing with Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPA.2009.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

摘要

支持通用程序的图形处理单元(Graphics Processing unit, gpu)是一种很有前途的高性能计算平台。然而,GPU与CPU在架构上的根本差异、GPU平台的复杂性以及GPU规格的多样性,使得为GPU生成高效的代码变得越来越困难。手动代码生成非常耗时,而且结果往往难以调试和维护。另一方面,由今天的GPU编译器生成的代码通常比最好的手动调优代码的性能低得多。由ATLAS \cite{Whaley}、FFTW \cite{FFTW_org}、SPIRAL \cite{Pueschel:05}和X-Sort \cite{Li:05}等系统实现的一种很有前途的代码生成策略,使用经验搜索来找到实现的参数值,例如块大小和指令时间表,为特定机器提供接近最佳的性能。然而,这种方法只有在应用于CPU时才被证明是成功的,因为CPU程序的性能已经得到了相对更好的理解。显然,经验搜索必须扩展到GPU上的通用程序。在本文中,我们提出了一种经验优化技术,用于GPU上最重要的排序例程之一,基数排序,该技术可为具有各种架构规范的许多具有代表性的NVIDIA GPU生成高效代码。我们的研究主要集中在可以适应不同环境的基数排序算法参数和影响基数排序性能的GPU架构因素。我们提出了一个强大的经验优化方法,该方法被证明能够为不同的NVIDIA gpu找到高效的代码。我们的结果表明,这种经验优化方法在考虑到架构特征之间的复杂交互方面非常有效,并且结果代码的性能明显优于两个基数排序实现,这两个实现的性能已经被证明优于其他GPU排序例程,最大加速提升了33.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
An Empirically Optimized Radix Sort for GPU
Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. On the other hand, the code generated by today's GPU compiler often has much lower performance than the best hand-tuned codes. A promising code generation strategy, implemented by systems like ATLAS~\cite{Whaley}, FFTW~\cite{FFTW_org}, SPIRAL~\cite{Pueschel:05} and X-Sort~\cite{Li:05}, uses empirical search to find the parameter values of the implementation, such as the tile size and instruction schedules, that deliver near-optimal performance for a particular machine. However, this approach has only proved successful when applied to CPU where the performance of CPU programs has been relatively better understood. Clearly, empirical search must be extended to general purpose programs on GPU. In this paper, we propose an empirical optimization technique for one of the most important sorting routines on GPU, the radix sort, that generates highly efficient code for a number of representative NVIDIA GPUs with a wide variety of architectural specifications. Our study has been focused on the algorithmic parameters of radix sort that can be adapted to different environments and the GPU architectural factors that affect the performance of radix sort. We present a powerful empirical optimization approach that is shown to be able to find highly efficient code for different NVIDIA GPUs. Our results show that such an empirical optimization approach is quite effective at taking into account the complex interactions between architectural characteristics and that the resulting code performs significantly better than two radix sort implementations that have been shown outperforming other GPU sort routines with the maximal speedup of 33.4\%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Completion Time Estimation for Instances of Generalized Well-Formed Workflow A Synchronization-Based Alternative to Directory Protocol Web Service Locating Unit in RFID-Centric Anti-counterfeit System Distributed Transfer Network Learning Based Intrusion Detection Multi-Source Traffic Data Fusion Method Based on Regulation and Reliability
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1