Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness.

IF 1.2 Q3 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Parallel Computing Pub Date : 2025-03-01 Epub Date: 2025-02-11 DOI:10.1145/3701623

Minh Pham, Yongke Yuan, Hao Li, Chengcheng Mou, Yicheng Tu, Zichen Xu, Jinghan Meng

{"title":"Dynamic Buffer Management in Massively Parallel Systems: The Power of Randomness.","authors":"Minh Pham, Yongke Yuan, Hao Li, Chengcheng Mou, Yicheng Tu, Zichen Xu, Jinghan Meng","doi":"10.1145/3701623","DOIUrl":null,"url":null,"abstract":"<p><p>Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today's data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of paramount importance. One such challenge is the design of a dynamic memory allocator to allocate memory at runtime. Traditionally, memory allocators have relied on maintaining a global data structure, such as a queue of free pages. However, in the context of massively parallel systems, accessing such global data structures can quickly become a bottleneck even with multiple queues in place. This paper presents a novel approach to dynamic memory allocation that eliminates the need for a centralized data structure. Our proposed approach revolves around letting threads employ random search procedures to locate free pages. Through mathematical proofs and extensive experiments, we demonstrate that the basic random search design achieves lower latency than the best-known existing solution in most situations. Furthermore, we develop more advanced techniques and algorithms to tackle the challenge of warp divergence and further enhance performance when free memory is limited. Building upon these advancements, our mathematical proofs and experimental results affirm that these advanced designs can yield an order of magnitude improvement over the basic design and consistently outperform the state-of-the-art by up to two orders of magnitude. To illustrate the practical implications of our work, we integrate our memory management techniques into two GPU algorithms: a hash join and a group-by. Both case studies provide compelling evidence of our approach's pronounced performance gains.</p>","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"12 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11841858/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3701623","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/11 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Massively parallel systems, such as Graphics Processing Units (GPUs), play an increasingly crucial role in today's data-intensive computing. The unique challenges associated with developing system software for massively parallel hardware to support numerous parallel threads efficiently are of paramount importance. One such challenge is the design of a dynamic memory allocator to allocate memory at runtime. Traditionally, memory allocators have relied on maintaining a global data structure, such as a queue of free pages. However, in the context of massively parallel systems, accessing such global data structures can quickly become a bottleneck even with multiple queues in place. This paper presents a novel approach to dynamic memory allocation that eliminates the need for a centralized data structure. Our proposed approach revolves around letting threads employ random search procedures to locate free pages. Through mathematical proofs and extensive experiments, we demonstrate that the basic random search design achieves lower latency than the best-known existing solution in most situations. Furthermore, we develop more advanced techniques and algorithms to tackle the challenge of warp divergence and further enhance performance when free memory is limited. Building upon these advancements, our mathematical proofs and experimental results affirm that these advanced designs can yield an order of magnitude improvement over the basic design and consistently outperform the state-of-the-art by up to two orders of magnitude. To illustrate the practical implications of our work, we integrate our memory management techniques into two GPU algorithms: a hash join and a group-by. Both case studies provide compelling evidence of our approach's pronounced performance gains.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大规模并行系统中的动态缓冲区管理：随机性的力量。

大规模并行系统，如图形处理单元（gpu），在当今的数据密集型计算中扮演着越来越重要的角色。为大规模并行硬件开发系统软件以有效地支持大量并行线程的独特挑战是至关重要的。其中一个挑战是动态内存分配器的设计，以便在运行时分配内存。传统上，内存分配器依赖于维护全局数据结构，例如空闲页面队列。然而，在大规模并行系统的上下文中，即使有多个队列，访问这样的全局数据结构也可能很快成为瓶颈。本文提出了一种新的动态内存分配方法，消除了对集中数据结构的需要。我们建议的方法是让线程使用随机搜索过程来定位空闲页面。通过数学证明和广泛的实验，我们证明了基本随机搜索设计在大多数情况下比最知名的现有解决方案实现更低的延迟。此外，我们开发了更先进的技术和算法来解决翘曲发散的挑战，并在可用内存有限的情况下进一步提高性能。在这些进步的基础上，我们的数学证明和实验结果证实，这些先进的设计可以比基本设计产生一个数量级的改进，并且始终比最先进的设计高出两个数量级。为了说明我们工作的实际意义，我们将内存管理技术集成到两种GPU算法中：散列连接和分组连接。这两个案例研究都提供了令人信服的证据，证明我们的方法显著提高了性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊