Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2020-05-01 DOI:10.1109/IPDPS47924.2020.00119

Kyle Berney, Nodari Sitchinava

{"title":"Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs","authors":"Kyle Berney, Nodari Sitchinava","doi":"10.1109/IPDPS47924.2020.00119","DOIUrl":null,"url":null,"abstract":"Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"1133-1142"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS47924.2020.00119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Currently, the fastest comparison-based sorting implementation on GPUs is implemented using a parallel pairwise merge sort algorithm (Thrust library). To achieve fast runtimes, the number of threads t to sort the input of N elements is fine-tuned experimentally for each generation of Nvidia GPUs in such a way that the number of elements E = N/t that each thread accesses in each merging round results in a small (empirically measured) number of shared memory contentions, known as bank conflicts, while balancing the number of global memory accesses and latency-hiding through thread oversubscription/occupancy.In this paper, we show that for every choice of E < w, such that E and w are co-prime, there exists an input permutation on which every warp of w threads of the Thrust merge sort is effectively reduced to using at most ⌈w/E⌉ threads due to sequentialization of shared memory accesses due to bank conflicts. Note that this matches the trivial worst-case bound on the loss of parallelism due to memory contentions for any warp accessing wE contiguous shared memory locations.Our proof is constructive, i.e., we are able to automatically construct such permutation for every value of E. We also show in practice that such constructed inputs result in up to ~50% slowdown, compared to the performance on random inputs, on modern GPU hardware.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

gpu上成对归并排序的工程最坏情况输入

目前，gpu上最快的基于比较的排序实现是使用并行成对归并排序算法(Thrust库)实现的。为了实现快速运行，对N个元素的输入进行排序的线程数t对每一代Nvidia gpu进行了实验微调，使得每个线程在每个合并轮中访问的元素数E = N/t导致少量(经验测量的)共享内存争用，称为银行冲突，同时通过线程超额订阅/占用来平衡全局内存访问和延迟隐藏的数量。在本文中，我们证明了对于E < w的每一个选择，使得E和w是共素数，存在一个输入置换，在这个输入置换上，由于银行冲突导致共享内存访问的顺序化，使得推力归并排序的w个线程的每一个warp都有效地减少到至多使用(w/E)个线程。请注意，这与由于任何warp访问wE连续共享内存位置的内存争用而导致的并行性损失的平凡最坏情况边界相匹配。我们的证明是建设性的，也就是说，我们能够为每个e的值自动构造这样的排列。我们还在实践中表明，与现代GPU硬件上随机输入的性能相比，这种构造的输入导致高达50%的减速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量

期刊最新文献

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner Resilience at Extreme Scale and Connections with Other Domains A Tale of Two C's: Convergence and Composability 12 Ways to Fool the Masses with Irreproducible Results Is Asymptotic Cost Analysis Useful in Developing Practical Parallel Algorithms