RTop-K:用于神经网络的超快行向 Top-K 算法和 GPU 实现

Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding
{"title":"RTop-K:用于神经网络的超快行向 Top-K 算法和 GPU 实现","authors":"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding","doi":"arxiv-2409.00822","DOIUrl":null,"url":null,"abstract":"Top-k algorithms are essential in various applications, from high-performance\ncomputing and information retrieval to big data and neural network model\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\nSearch-based approach to optimize resource allocation and provides a scalable\nsolution that significantly accelerates top-k operations. We perform a\ntheoretical analysis of the effects of early stopping in our algorithm,\ndemonstrating that it maintains the accuracy of neural network models while\nenhancing performance. Comprehensive tests show that our GPU implementation of\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\nachieves speed increases ranging from 4.245$\\times$ to 9.506$\\times$ with early\nstopping, and 3.936$\\times$ without early stopping, compared to\nstate-of-the-art implementations. The proposed methods offer significant\nimprovements in the training and inference of Graph Neural Networks (GNNs),\naddressing critical challenges in latency and throughput on GPU platforms.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"268 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks\",\"authors\":\"Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding\",\"doi\":\"arxiv-2409.00822\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Top-k algorithms are essential in various applications, from high-performance\\ncomputing and information retrieval to big data and neural network model\\ntraining. This paper introduces RTop-K, a highly efficient parallel row-wise\\ntop-k selection algorithm designed for GPUs. RTop-K employs a Binary\\nSearch-based approach to optimize resource allocation and provides a scalable\\nsolution that significantly accelerates top-k operations. We perform a\\ntheoretical analysis of the effects of early stopping in our algorithm,\\ndemonstrating that it maintains the accuracy of neural network models while\\nenhancing performance. Comprehensive tests show that our GPU implementation of\\nRTop-K outperforms other row-wise top-k GPU implementations, with minimal\\nimpact on testing accuracy when early stopping is applied. Notably, RTop-K\\nachieves speed increases ranging from 4.245$\\\\times$ to 9.506$\\\\times$ with early\\nstopping, and 3.936$\\\\times$ without early stopping, compared to\\nstate-of-the-art implementations. The proposed methods offer significant\\nimprovements in the training and inference of Graph Neural Networks (GNNs),\\naddressing critical challenges in latency and throughput on GPU platforms.\",\"PeriodicalId\":501422,\"journal\":{\"name\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"volume\":\"268 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Distributed, Parallel, and Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.00822\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00822","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

从高性能计算和信息检索到大数据和神经网络模型训练,拓扑-k 算法在各种应用中都是必不可少的。本文介绍了 RTop-K,这是一种专为 GPU 设计的高效并行行智顶 k 选择算法。RTop-K 采用了一种基于二进制搜索的方法来优化资源分配,并提供了一种可扩展的解决方案,大大加快了 top-k 运算的速度。我们对算法中早期停止的效果进行了理论分析,证明它在提高性能的同时保持了神经网络模型的准确性。综合测试表明,我们的RTop-K GPU实现优于其他行向顶k GPU实现,在应用早期停止时对测试精度的影响最小。值得注意的是,与最先进的实现相比,RTop-K在使用提前停止的情况下速度提高了4.245倍到9.506倍,而在不使用提前停止的情况下提高了3.936倍。所提出的方法大大改进了图神经网络(GNN)的训练和推理,解决了GPU平台在延迟和吞吐量方面的关键挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks
Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early stopping, and 3.936$\times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1