Engineering In-place (Shared-memory) Sorting Algorithms

Pub Date : 2020-09-28 DOI:10.1145/3505286

Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders

{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":null,"url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3505286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

查看原文

微信好友朋友圈 QQ好友复制链接

工程就地(共享内存)排序算法

我们提出了新的顺序和并行排序算法，这些算法现在代表了适用于各种输入大小、输入分布、数据类型和机器的已知最快技术。令人惊讶的是，速度优势的一部分是由于算法的额外功能，即它们不需要输入阵列之外的大量空间。以前，就地功能通常意味着性能惩罚。我们的主要算法贡献是一种可证明具有缓存效率的块式就地数据分发方法。我们还将这种方法并行化，同时考虑到动态负载平衡和内存局部性。我们新的基于比较的原位并行超标量样本排序算法（IPS4o）将该技术与无分支决策树相结合。通过考虑具有许多相等元素的情况并动态调整分布度，我们获得了一种高度鲁棒的算法，该算法比以前最好的基于并行比较的排序算法几乎高出三倍。无论我们是否考虑到位、并行或顺序设置，该算法也优于基于比较的最佳竞争对手。另一个令人惊讶的结果是，IPS4o甚至在各种情况下都优于最好的（原位或非原位）整数排序算法。在剩下的许多情况下（通常涉及接近均匀的输入分布、小键或顺序设置），我们新的原位并行超标量基数排序（IPS2Ra）被证明是最好的算法。在许多论文中都可以找到声称拥有某种意义上“最佳”排序算法的说法，但这些说法不可能都是真的。因此，我们的结论基于一项广泛的实验研究，该研究涉及21种最先进的排序代码、6种数据类型、10种输入分布、4台机器、4种内存分配策略和7个数量级以上的输入大小的大部分叉积。这证实了关于我们算法稳健性能的说法，同时揭示了许多竞争对手在相关出版物中报道的具体测量之外的主要性能问题。整数排序算法尤其如此，这给了我们一个理由，即更喜欢基于比较的算法进行稳健的通用排序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助