gpu上基于比较的排序算法的分析驱动工程

Proceedings of the 2018 International Conference on Supercomputing Pub Date : 2018-06-12 DOI:10.1145/3205289.3205298

Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava

{"title":"gpu上基于比较的排序算法的分析驱动工程","authors":"Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava","doi":"10.1145/3205289.3205298","DOIUrl":null,"url":null,"abstract":"We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.","PeriodicalId":441217,"journal":{"name":"Proceedings of the 2018 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs\",\"authors\":\"Ben Karsin, Volker Weichert, H. Casanova, J. Iacono, Nodari Sitchinava\",\"doi\":\"10.1145/3205289.3205298\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.\",\"PeriodicalId\":441217,\"journal\":{\"name\":\"Proceedings of the 2018 International Conference on Supercomputing\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3205289.3205298\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3205289.3205298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

我们研究了图形处理单元(gpu)基于比较的排序算法中内存访问、银行冲突、线程多重性(也称为超额订阅)和指令级并行性之间的关系。我们通过实验验证了所提出的公式，该公式将这些参数与通过算法对存储器访问次数的渐近分析联系起来。使用这个公式，我们分析和比较了几种GPU排序算法，确定了每种算法的关键性能瓶颈。基于此分析，我们提出了一种gpu高效的多路合并排序算法GPU-MMS，它可以最大限度地减少或消除这些瓶颈，并平衡特定硬件的各种限制因素。我们实现了GPU- mms的实现，并将其与三种GPU架构上最先进的GPU库中的排序算法实现进行了比较。尽管这些库实现得到了高度优化，但我们发现GPU-MMS在随机整数输入和随机键值对方面的性能比它们平均高出21%和14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of memory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge-sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

ComPEND CELIA PA-SSD: A Page-Type Aware TLC SSD for Improved Write/Read Performance and Storage Efficiency GRU Sculptor: Flexible Approximation with Selective Dynamic Loop Perforation