Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Kyle Berney, Nodari Sitchinava
{"title":"Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs","authors":"Kyle Berney, Nodari Sitchinava","doi":"10.1109/IPDPS47924.2020.00119","DOIUrl":null,"url":null,"abstract":"Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"1133-1142"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
gpu上成对归并排序的工程最坏情况输入
目前,gpu上最快的基于比较的排序实现是使用并行成对归并排序算法(Thrust库)实现的。为了实现快速运行,对N个元素的输入进行排序的线程数t对每一代Nvidia gpu进行了实验微调,使得每个线程在每个合并轮中访问的元素数E = N/t导致少量(经验测量的)共享内存争用,称为银行冲突,同时通过线程超额订阅/占用来平衡全局内存访问和延迟隐藏的数量。在本文中,我们证明了对于E < w的每一个选择,使得E和w是共素数,存在一个输入置换,在这个输入置换上,由于银行冲突导致共享内存访问的顺序化,使得推力归并排序的w个线程的每一个warp都有效地减少到至多使用(w/E)个线程。请注意,这与由于任何warp访问wE连续共享内存位置的内存争用而导致的并行性损失的平凡最坏情况边界相匹配。我们的证明是建设性的,也就是说,我们能够为每个e的值自动构造这样的排列。我们还在实践中表明,与现代GPU硬件上随机输入的性能相比,这种构造的输入导致高达50%的减速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner Resilience at Extreme Scale and Connections with Other Domains A Tale of Two C's: Convergence and Composability 12 Ways to Fool the Masses with Irreproducible Results Is Asymptotic Cost Analysis Useful in Developing Practical Parallel Algorithms
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1