RTop-K：用于神经网络的超快行向 Top-K 算法和 GPU 实现

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-01 DOI:arxiv-2409.00822

Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding

{"title":"RTop-K：用于神经网络的超快行向 Top-K 算法和 GPU 实现","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":null,"url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\ncomputing and information retrieval to big data and neural network model\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\nSearch-based approach to optimize resource allocation and provides a scalable\nsolution that significantly accelerates top-k operations. We perform a\ntheoretical analysis of the effects of early stopping in our algorithm,\ndemonstrating that it maintains the accuracy of neural network models while\nenhancing performance. Comprehensive tests show that our GPU implementation of\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\nachieves speed increases ranging from 4.245$\\times$ to 9.506$\\times$ with early\nstopping, and 3.936$\\times$ without early stopping, compared to\nstate-of-the-art implementations. The proposed methods offer significant\nimprovements in the training and inference of Graph Neural Networks (GNNs),\naddressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks\",\"authors\":\"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding\",\"doi\":\"arxiv-2409.00822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Top-k algorithms are essential in various applications, from high-performance\\ncomputing and information retrieval to big data and neural network model\\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\\nSearch-based approach to optimize resource allocation and provides a scalable\\nsolution that significantly accelerates top-k operations. We perform a\\ntheoretical analysis of the effects of early stopping in our algorithm,\\ndemonstrating that it maintains the accuracy of neural network models while\\nenhancing performance. Comprehensive tests show that our GPU implementation of\\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\\nachieves speed increases ranging from 4.245$\\\\times$ to 9.506$\\\\times$ with early\\nstopping, and 3.936$\\\\times$ without early stopping, compared to\\nstate-of-the-art implementations. The proposed methods offer significant\\nimprovements in the training and inference of Graph Neural Networks (GNNs),\\naddressing critical challenges in latency and throughput on GPU platforms.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"268 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

从高性能计算和信息检索到大数据和神经网络模型训练，拓扑-k 算法在各种应用中都是必不可少的。本文介绍了 RTop-K，这是一种专为 GPU 设计的高效并行行智顶 k 选择算法。RTop-K 采用了一种基于二进制搜索的方法来优化资源分配，并提供了一种可扩展的解决方案，大大加快了 top-k 运算的速度。我们对算法中早期停止的效果进行了理论分析，证明它在提高性能的同时保持了神经网络模型的准确性。综合测试表明，我们的RTop-K GPU实现优于其他行向顶k GPU实现，在应用早期停止时对测试精度的影响最小。值得注意的是，与最先进的实现相比，RTop-K在使用提前停止的情况下速度提高了4.245倍到9.506倍，而在不使用提前停止的情况下提高了3.936倍。所提出的方法大大改进了图神经网络（GNN）的训练和推理，解决了GPU平台在延迟和吞吐量方面的关键挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks

Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early stopping, and 3.936$\times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844