首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
G-Learned Index: Enabling Efficient Learned Index on GPU G-学习索引:在 GPU 上实现高效学习索引
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-04-02 DOI: 10.1109/TPDS.2024.3381214
Jiesong Liu;Feng Zhang;Lv Lu;Chang Qi;Xiaoguang Guo;Dong Deng;Guoliang Li;Huanchen Zhang;Jidong Zhai;Hechen Zhang;Yuxing Chen;Anqun Pan;Xiaoyong Du
AI and GPU technologies have been widely applied to solve Big Data problems. The total data volume worldwide reaches 200 zettabytes in 2022. How to efficiently index the required content among massive data becomes serious. Recently, a promising learned index has been proposed to address this challenge: It has extremely high efficiency while retaining marginal space overhead. However, we notice that previous learned indexes have mainly focused on CPU architecture, while ignoring the advantages of GPU. Because traditional indexes like B-Tree, LSM, and bitmap have greatly benefited from GPU acceleration, a combination of a learned index and GPU has great potentials to reach tremendous speedups. In this paper, we propose a GPU-based learned index, called G-Learned Index, to significantly improve the performance of learned index structures. The primary challenges in developing G-Learned Index lie in the use of thousands of GPU cores including minimization of synchronization and branch divergence, data structure design for parallel operations, and usage of memory bandwidth including limited memory transactions and multi-memory hierarchy. To overcome these challenges, a series of novel technologies are developed, including efficient thread organization, succinct data structures, and heterogeneous memory hierarchy utilization. Compared to the state-of-the-art learned index, the proposed G-Learned Index achieves an average of 174× speedup (and 107× of its parallel version). Meanwhile, we attain 2× less query time over the state-of-the-art GPU B-Tree. Our further exploration of range queries shows that G-Learned Index is $17times$ faster than CPU multi-dimensional learned index.
人工智能和 GPU 技术已被广泛应用于解决大数据问题。2022 年,全球数据总量将达到 200 ZB。如何在海量数据中高效地索引所需的内容变得非常重要。最近,一种很有前途的学习索引被提出来应对这一挑战:它具有极高的效率,同时保留了边际空间开销。然而,我们注意到,以往的学习索引主要集中在 CPU 架构上,而忽略了 GPU 的优势。由于 B-Tree、LSM 和 bitmap 等传统索引已从 GPU 加速中获益良多,因此将学习索引与 GPU 结合在一起具有极大的潜力,可以实现极大的提速。在本文中,我们提出了一种基于 GPU 的学习索引,称为 G-Learned Index,以显著提高学习索引结构的性能。开发 G-Learned Index 的主要挑战在于如何使用成千上万的 GPU 内核,包括同步和分支发散的最小化、并行操作的数据结构设计以及内存带宽的使用(包括有限的内存事务和多内存分层)。为了克服这些挑战,我们开发了一系列新技术,包括高效线程组织、简洁数据结构和异构内存分级利用。与最先进的学习索引相比,所提出的 G-Learned Index 平均提速 174 倍(并行版本提速 107 倍)。同时,与最先进的 GPU B-Tree 相比,我们的查询时间缩短了 2 倍。我们对范围查询的进一步探索表明,G-Learned 索引比 CPU 多维学习索引快 17 倍。
{"title":"G-Learned Index: Enabling Efficient Learned Index on GPU","authors":"Jiesong Liu;Feng Zhang;Lv Lu;Chang Qi;Xiaoguang Guo;Dong Deng;Guoliang Li;Huanchen Zhang;Jidong Zhai;Hechen Zhang;Yuxing Chen;Anqun Pan;Xiaoyong Du","doi":"10.1109/TPDS.2024.3381214","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3381214","url":null,"abstract":"AI and GPU technologies have been widely applied to solve Big Data problems. The total data volume worldwide reaches 200 zettabytes in 2022. How to efficiently index the required content among massive data becomes serious. Recently, a promising learned index has been proposed to address this challenge: It has extremely high efficiency while retaining marginal space overhead. However, we notice that previous learned indexes have mainly focused on CPU architecture, while ignoring the advantages of GPU. Because traditional indexes like B-Tree, LSM, and bitmap have greatly benefited from GPU acceleration, a combination of a learned index and GPU has great potentials to reach tremendous speedups. In this paper, we propose a GPU-based learned index, called G-Learned Index, to significantly improve the performance of learned index structures. The primary challenges in developing G-Learned Index lie in the use of thousands of GPU cores including minimization of synchronization and branch divergence, data structure design for parallel operations, and usage of memory bandwidth including limited memory transactions and multi-memory hierarchy. To overcome these challenges, a series of novel technologies are developed, including efficient thread organization, succinct data structures, and heterogeneous memory hierarchy utilization. Compared to the state-of-the-art learned index, the proposed G-Learned Index achieves an average of 174× speedup (and 107× of its parallel version). Meanwhile, we attain 2× less query time over the state-of-the-art GPU B-Tree. Our further exploration of range queries shows that G-Learned Index is \u0000<inline-formula><tex-math>$17times$</tex-math></inline-formula>\u0000 faster than CPU multi-dimensional learned index.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140546591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Static Allocation is Not a Static: Optimizing SSD Address Allocation Through Boosting Static Policy 静态分配并非一成不变:通过提升静态策略优化固态硬盘地址分配
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-30 DOI: 10.1109/TPDS.2024.3407367
Yang Zhou;Fang Wang;Zhan Shi;Dan Feng
The address allocation policy in SSD aims to translate the logical address of I/O requests into a physical address, and the static address allocation is widely used in modern SSD. Through extensive experiments, we find that there are significant differences in the utilization of SSD parallelism among different static address allocation policies. We also observe that the fixed address allocation design prevents SSDs from continuing to meet the challenges posed by cloud workloads and misses the possibility of further optimization. These situations stem from our excessive reliance on SSD parallelism over time. In this paper, we propose HsaP, a hybrid static address allocation policy, that adaptively chooses the best static allocation policy to meet the SSD performance at runtime. HsaP is a dynamic scheduling scheme based on static address allocation policy. The static policy ensures that HsaP has stable performance and light-weight overhead, while dynamic scheduling can effectively combine different allocation policies, selecting the best-performing static mapping mode for a given SSD state. Meanwhile, HsaP can further improve the read and write performance of SSDs simultaneously through plane reallocation and data rewrite. Experimental results show that HsaP achieves significant read and write performance gain of a wide range of the latest cloud block storage traces compared to several state-of-the-art address allocation approaches.
固态硬盘中的地址分配策略旨在将 I/O 请求的逻辑地址转换为物理地址,而静态地址分配在现代固态硬盘中得到了广泛应用。通过大量实验,我们发现不同静态地址分配策略对固态硬盘并行性的利用率存在显著差异。我们还发现,固定地址分配设计阻碍了固态硬盘继续应对云工作负载带来的挑战,并错失了进一步优化的可能性。这些情况都源于我们长期以来对固态硬盘并行性的过度依赖。在本文中,我们提出了混合静态地址分配策略HsaP,它能自适应地选择最佳静态分配策略,以满足运行时的固态硬盘性能。HsaP 是一种基于静态地址分配策略的动态调度方案。静态策略确保了 HsaP 性能稳定、开销轻巧,而动态调度能有效结合不同的分配策略,为给定的固态硬盘状态选择性能最佳的静态映射模式。同时,HsaP 还能通过平面重新分配和数据重写,进一步同时提高固态硬盘的读写性能。实验结果表明,与几种最先进的地址分配方法相比,HsaP 在各种最新的云块存储跟踪中实现了显著的读写性能提升。
{"title":"The Static Allocation is Not a Static: Optimizing SSD Address Allocation Through Boosting Static Policy","authors":"Yang Zhou;Fang Wang;Zhan Shi;Dan Feng","doi":"10.1109/TPDS.2024.3407367","DOIUrl":"10.1109/TPDS.2024.3407367","url":null,"abstract":"The address allocation policy in SSD aims to translate the logical address of I/O requests into a physical address, and the static address allocation is widely used in modern SSD. Through extensive experiments, we find that there are significant differences in the utilization of SSD parallelism among different static address allocation policies. We also observe that the fixed address allocation design prevents SSDs from continuing to meet the challenges posed by cloud workloads and misses the possibility of further optimization. These situations stem from our excessive reliance on SSD parallelism over time. In this paper, we propose \u0000<monospace>HsaP</monospace>\u0000, a \u0000<underline>h</u>\u0000ybrid \u0000<underline>s</u>\u0000tatic address \u0000<underline>a</u>\u0000llocation \u0000<underline>p</u>\u0000olicy, that adaptively chooses the best static allocation policy to meet the SSD performance at runtime. \u0000<monospace>HsaP</monospace>\u0000 is a \u0000<italic>dynamic</i>\u0000 scheduling scheme based on \u0000<italic>static</i>\u0000 address allocation policy. The static policy ensures that \u0000<monospace>HsaP</monospace>\u0000 has stable performance and light-weight overhead, while dynamic scheduling can effectively combine different allocation policies, selecting the best-performing static mapping mode for a given SSD state. Meanwhile, \u0000<monospace>HsaP</monospace>\u0000 can further improve the read and write performance of SSDs simultaneously through plane reallocation and data rewrite. Experimental results show that \u0000<monospace>HsaP</monospace>\u0000 achieves significant read and write performance gain of a wide range of the latest cloud block storage traces compared to several state-of-the-art address allocation approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities 现代高性能计算系统中的可塑性:当前经验、挑战和未来机遇
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-29 DOI: 10.1109/TPDS.2024.3406764
Ahmad Tarraf;Martin Schreiber;Alberto Cascajo;Jean-Baptiste Besnard;Marc-André Vef;Dominik Huber;Sonja Happ;André Brinkmann;David E. Singh;Hans-Christian Hoppe;Alberto Miranda;Antonio J. Peña;Rui Machado;Marta Garcia-Gasulla;Martin Schulz;Paul Carpenter;Simon Pickartz;Tiberiu Rotaru;Sergio Iserte;Victor Lopez;Jorge Ejarque;Heena Sirwani;Jesus Carretero;Felix Wolf
With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications’ configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.
随着由工作流和异构工作负载特征驱动的复杂科学模拟的增加,有效管理系统资源对于提高性能和系统吞吐量至关重要,特别是在异构高性能计算和带有片上加速器的深度集成系统等趋势下。为了优化资源利用率,动态资源分配可以根据系统资源调整应用配置,从而提高所有系统和应用层面的生产率。在这种情况下,可在运行时改变资源的可延展作业可以提高系统吞吐量和资源利用率,同时为高性能计算用户带来各种优势(如缩短等待时间)。尽管可延展性是一个活跃了二十多年的研究领域,但它最近却受到了广泛关注。本文介绍了高性能计算系统中可延展性实现的最新进展,主要针对计算和 I/O 资源的可延展性。根据我们的经验,我们阐述了当前关注的问题,并列举了未来的研究机会。
{"title":"Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities","authors":"Ahmad Tarraf;Martin Schreiber;Alberto Cascajo;Jean-Baptiste Besnard;Marc-André Vef;Dominik Huber;Sonja Happ;André Brinkmann;David E. Singh;Hans-Christian Hoppe;Alberto Miranda;Antonio J. Peña;Rui Machado;Marta Garcia-Gasulla;Martin Schulz;Paul Carpenter;Simon Pickartz;Tiberiu Rotaru;Sergio Iserte;Victor Lopez;Jorge Ejarque;Heena Sirwani;Jesus Carretero;Felix Wolf","doi":"10.1109/TPDS.2024.3406764","DOIUrl":"10.1109/TPDS.2024.3406764","url":null,"abstract":"With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications’ configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for more than two decades. This article presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10541114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration Computing 在 6 G 云边缘协作计算中利用分布式深度强化学习实现主动缓存
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-28 DOI: 10.1109/TPDS.2024.3406027
Changmao Wu;Zhengwei Xu;Xiaoming He;Qi Lou;Yuanyuan Xia;Shuman Huang
Proactive caching in 6G cloud-edge collaboration scenarios, intelligently and periodically updating the cached contents, can either alleviate the traffic congestion of backhaul link and edge cooperative link or bring multimedia services to mobile users. To further improve the network performance of 6G cloud-edge, we consider the issue of multi-objective joint optimization, i.e., maximizing edge hit ratio while minimizing content access latency and traffic cost. To solve this complex problem, we focus on the distributed deep reinforcement learning (DRL)-based method for proactive caching, including content prediction and content decision-making. Specifically, since the prior information of user requests is seldom available practically in the current time period, a novel method named temporal convolution sequence network (TCSN) based on the temporal convolution network (TCN) and attention model is used to improve the accuracy of content prediction. Furthermore, according to the value of content prediction, the distributional deep Q network (DDQN) seeks to build a distribution model on returns to optimize the policy of content decision-making. The generative adversarial network (GAN) is adapted in a distributed fashion, emphasizing learning the data distribution and generating compelling data across multiple nodes. In addition, the prioritized experience replay (PER) is helpful to learn from the most effective sample. So we propose a multivariate fusion algorithm called PG-DDQN. Finally, faced with such a complex scenario, a distributed learning architecture, i.e., multi-agent learning architecture is efficiently used to learn DRL-based methods in a manner of centralized training and distributed inference. The experiments prove that our proposal achieves satisfactory performance in terms of edge hit ratio, traffic cost and content access latency.
在 6G 云边协作场景中主动缓存,智能地定期更新缓存内容,既能缓解回程链路和边缘协作链路的流量拥塞,又能为移动用户带来多媒体服务。为了进一步提高 6G 云边缘的网络性能,我们考虑了多目标联合优化问题,即在最大化边缘命中率的同时,最小化内容访问延迟和流量成本。为解决这一复杂问题,我们重点研究了基于分布式深度强化学习(DRL)的主动缓存方法,包括内容预测和内容决策。具体来说,由于用户请求的先验信息在当前时间段很少能实际获得,因此我们采用了一种基于时空卷积网络(TCN)和注意力模型的名为时空卷积序列网络(TCSN)的新方法来提高内容预测的准确性。此外,根据内容预测的价值,分布式深度 Q 网络(DDQN)寻求建立回报分布模型,以优化内容决策策略。生成式对抗网络(GAN)以分布式方式进行调整,强调学习数据分布,并在多个节点上生成引人注目的数据。此外,优先经验重放(PER)有助于从最有效的样本中学习。因此,我们提出了一种名为 PG-DDQN 的多元融合算法。最后,面对如此复杂的场景,我们采用分布式学习架构,即多代理学习架构,以集中训练和分布式推理的方式有效地学习基于 DRL 的方法。实验证明,我们的建议在边缘命中率、流量成本和内容访问延迟方面都取得了令人满意的性能。
{"title":"Proactive Caching With Distributed Deep Reinforcement Learning in 6G Cloud-Edge Collaboration Computing","authors":"Changmao Wu;Zhengwei Xu;Xiaoming He;Qi Lou;Yuanyuan Xia;Shuman Huang","doi":"10.1109/TPDS.2024.3406027","DOIUrl":"10.1109/TPDS.2024.3406027","url":null,"abstract":"Proactive caching in 6G cloud-edge collaboration scenarios, intelligently and periodically updating the cached contents, can either alleviate the traffic congestion of backhaul link and edge cooperative link or bring multimedia services to mobile users. To further improve the network performance of 6G cloud-edge, we consider the issue of multi-objective joint optimization, i.e., maximizing edge hit ratio while minimizing content access latency and traffic cost. To solve this complex problem, we focus on the distributed deep reinforcement learning (DRL)-based method for proactive caching, including content prediction and content decision-making. Specifically, since the prior information of user requests is seldom available practically in the current time period, a novel method named temporal convolution sequence network (TCSN) based on the temporal convolution network (TCN) and attention model is used to improve the accuracy of content prediction. Furthermore, according to the value of content prediction, the distributional deep Q network (DDQN) seeks to build a distribution model on returns to optimize the policy of content decision-making. The generative adversarial network (GAN) is adapted in a distributed fashion, emphasizing learning the data distribution and generating compelling data across multiple nodes. In addition, the prioritized experience replay (PER) is helpful to learn from the most \u0000<italic>effective</i>\u0000 sample. So we propose a multivariate fusion algorithm called PG-DDQN. Finally, faced with such a complex scenario, a distributed learning architecture, i.e., multi-agent learning architecture is efficiently used to learn DRL-based methods in a manner of centralized training and distributed inference. The experiments prove that our proposal achieves satisfactory performance in terms of edge hit ratio, traffic cost and content access latency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141192857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training 混合并行 DNN 训练的多维通信调度方法
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-28 DOI: 10.1109/TPDS.2024.3406420
Shengwei Li;Kai Lu;Zhiquan Lai;Weijie Liu;Keshi Ge;Dongsheng Li
The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose TriRace, an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, TriRace introduces 3D communication scheduling, which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, TriRace also incorporates a topology-aware runtime which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of TriRace based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that TriRace achieves up to 1.07–1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24–1.81× speedup compared to the Megatron.
基于变压器的深度神经网络(DNN)模型在各种任务中取得了相当大的成功,促使分布式训练方法(如数据并行和管道并行)得到广泛采用。随着参数数量的增加,混合并行训练成为扩展训练的当务之急。扩展的主要瓶颈仍然是通信开销。通信调度技术强调通信与计算的重叠,已经证明了其在扩展方面的优势。然而,现有的大多数工作都侧重于数据并行性,忽略了混合并行训练的细微差别。在本文中,我们提出了一个高效的通信调度框架 TriRace,用于在异步流水线并行和数据并行的混合并行训练中加速通信。为了实现有效的计算-通信重叠,TriRace 引入了三维通信调度,善于利用通信与计算之间的数据依赖关系,在混合并行训练中高效调度 AllReduce 通信、稀疏通信和点对点通信。为了避免可能出现的通信争议,TriRace 还采用了拓扑感知运行时,通过考虑正在进行的通信操作和实时网络状态来优化通信操作的执行。我们基于 PyTorch 和 Pipedream-2BW 实现了 TriRace 的原型,并用三个具有代表性的基线进行了综合评估。实验结果表明,与最先进的流水线并行性训练基线 Pipedream-2BW 相比,TriRace 实现了高达 1.07-1.45 倍的速度提升,与 Megatron 相比,实现了 1.24-1.81 倍的速度提升。
{"title":"A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training","authors":"Shengwei Li;Kai Lu;Zhiquan Lai;Weijie Liu;Keshi Ge;Dongsheng Li","doi":"10.1109/TPDS.2024.3406420","DOIUrl":"10.1109/TPDS.2024.3406420","url":null,"abstract":"The transformer-based deep neural network (DNN) models have shown considerable success across diverse tasks, prompting widespread adoption of distributed training methods such as data parallelism and pipeline parallelism. With the increasing parameter number, hybrid parallel training becomes imperative to scale training. The primary bottleneck in scaling remains the communication overhead. The communication scheduling technique, emphasizing the overlap of communication with computation, has demonstrated its benefits in scaling. However, most existing works focus on data parallelism, overlooking the nuances of hybrid parallel training. In this paper, we propose \u0000<monospace>TriRace</monospace>\u0000, an efficient communication scheduling framework for accelerating communications in hybrid parallel training of asynchronous pipeline parallelism and data parallelism. To achieve effective computation-communication overlap, \u0000<monospace>TriRace</monospace>\u0000 introduces \u0000<italic>3D communication scheduling</i>\u0000, which adeptly leverages data dependencies between communication and computations, efficiently scheduling AllReduce communication, sparse communication, and peer-to-peer communication in hybrid parallel training. To avoid possible communication contentions, \u0000<monospace>TriRace</monospace>\u0000 also incorporates a \u0000<italic>topology-aware runtime</i>\u0000 which optimizes the execution of communication operations by considering ongoing communication operations and real-time network status. We have implemented a prototype of \u0000<monospace>TriRace</monospace>\u0000 based on PyTorch and Pipedream-2BW, and conducted comprehensive evaluations with three representative baselines. Experimental results show that \u0000<monospace>TriRace</monospace>\u0000 achieves up to 1.07–1.45× speedup compared to the state-of-the-art pipeline parallelism training baseline Pipedream-2BW, and 1.24–1.81× speedup compared to the Megatron.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141198394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Computation of Dominance Scores for Multidimensional Datasets on GPUs 在 GPU 上并行计算多维数据集的优势分数
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-27 DOI: 10.1109/TPDS.2024.3382119
Wei-Mei Chen;Hsin-Hung Tsai;Joon Fong Ling
The dominance scoring problem in a multidimensional dataset is to return the number of points dominated by a given point, which is a common metric for evaluating the quality of a data point. Dominance scoring is an elementary operator for variations of the skyline operator, including top-$k$ dominating and $k$-skyband queries. This study proposes query processing for dominance scores that operates primarily on the graphics processing unit (GPU) to fully utilize its massive processing resources and restricted memory space while reducing the transfer overhead between the central processing unit (CPU) and GPU. We introduce a heap-based multidimensional data structure with complete and well-balanced characteristics. Using our preprocessed data, we can construct a complete R-tree with the non-overlapping property, ensuring that the bounding boxes of internal nodes of the same level do not overlap, thereby reducing redundant operations. In addition, we propose two algorithms based on depth-first and breadth-first traversals to accumulate the dominance score on GPUs in parallel. Both take full advantage of the GPU's computing resources and memory space supported by the non-overlapping tree structures. Experiments on synthetic and real-world datasets demonstrate that the proposed algorithms implemented on GPUs dramatically improve the efficiency of dominance scoring.
多维数据集中的占优评分问题是返回被给定点占优的点数,这是评估数据点质量的常用指标。支配评分是天际线算子变体的基本算子,包括顶部-$k$支配和$k$-天带查询。本研究提出的优势得分查询处理主要在图形处理器(GPU)上运行,以充分利用其庞大的处理资源和有限的内存空间,同时减少中央处理器(CPU)和 GPU 之间的传输开销。我们引入了一种基于堆的多维数据结构,它具有完整而均衡的特性。利用预处理数据,我们可以构建具有非重叠特性的完整 R 树,确保同级内部节点的边界框不重叠,从而减少冗余操作。此外,我们还提出了两种基于深度优先遍历和广度优先遍历的算法,用于在 GPU 上并行累积优势得分。这两种算法都充分利用了 GPU 的计算资源和无重叠树结构所支持的内存空间。在合成数据集和真实数据集上进行的实验表明,在 GPU 上实现的拟议算法极大地提高了优势得分的效率。
{"title":"Parallel Computation of Dominance Scores for Multidimensional Datasets on GPUs","authors":"Wei-Mei Chen;Hsin-Hung Tsai;Joon Fong Ling","doi":"10.1109/TPDS.2024.3382119","DOIUrl":"10.1109/TPDS.2024.3382119","url":null,"abstract":"The dominance scoring problem in a multidimensional dataset is to return the number of points dominated by a given point, which is a common metric for evaluating the quality of a data point. Dominance scoring is an elementary operator for variations of the skyline operator, including top-\u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000 dominating and \u0000<inline-formula><tex-math>$k$</tex-math></inline-formula>\u0000-skyband queries. This study proposes query processing for dominance scores that operates primarily on the graphics processing unit (GPU) to fully utilize its massive processing resources and restricted memory space while reducing the transfer overhead between the central processing unit (CPU) and GPU. We introduce a heap-based multidimensional data structure with complete and well-balanced characteristics. Using our preprocessed data, we can construct a complete R-tree with the non-overlapping property, ensuring that the bounding boxes of internal nodes of the same level do not overlap, thereby reducing redundant operations. In addition, we propose two algorithms based on depth-first and breadth-first traversals to accumulate the dominance score on GPUs in parallel. Both take full advantage of the GPU's computing resources and memory space supported by the non-overlapping tree structures. Experiments on synthetic and real-world datasets demonstrate that the proposed algorithms implemented on GPUs dramatically improve the efficiency of dominance scoring.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Age-of-Event Aware: Sampling Period Optimization in a Three-Stage Wireless Cyber-Physical System With Diverse Parallelisms 事件年龄感知:具有多种并行性的三阶段无线网络物理系统中的采样周期优化
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-27 DOI: 10.1109/TPDS.2024.3405790
Yanxi Zhang;Muyu Mei;Dongqi Yan;Xu Zhang;Qinghai Yang;Mingwu Yao
With the emergence of parallel computing systems and distributed time-sensitive applications, it is urgent to provide statistical guarantees for age of information (AoI) in wireless cyber-physical systems (WCPS) with diverse parallelisms. However, most of the existing research on AoI have tended to focus on serial transmission, and the AoI performance of multi-stage parallel systems remains unclear. To help address these research gaps, in this work, we set out to investigate the age of event (AoE) violation probability in a three-stage WCPS with diverse parallelisms such as fork-join and split-merge. We analyze both transient and steady-state characteristics of AoE violation probability (AoEVP). Using these characteristics, we transform the AoEVP minimization problem into an equivalent minimization problem. Moreover, we develop a queuing model to capture the queue dynamics under the max-plus theory of stochastic network calculus (SNC) approach. Based on the max-plus model, we derive a closed-form Chernoff upper bound for the equivalent problem by applying the union bound and the Chernoff inequality. Furthermore, we characterize the service process for different parallelisms applicable to each stage. By solving the Chernoff upper bound with the service moment generation functions (MGFs), we obtain heuristic update period solutions for minimizing the AoEVP of three-stage WCPS. Simulation results validate our analysis and demonstrate that our heuristic update period solutions are near optimal for minimizing the AoEVP of three-stage WCPS with diverse parallelisms.
随着并行计算系统和分布式时间敏感应用的出现,迫切需要为具有不同并行性的无线网络物理系统(WCPS)中的信息年龄(AoI)提供统计保证。然而,现有的 AoI 研究大多倾向于串行传输,多级并行系统的 AoI 性能仍不明确。为了帮助解决这些研究空白,我们在本研究中着手研究具有叉接和拆分合并等多种并行方式的三阶段 WCPS 中的事件年龄(AoE)违反概率。我们分析了 AoE 违反概率 (AoEVP) 的瞬态和稳态特征。利用这些特征,我们将 AoEVP 最小化问题转化为等价最小化问题。此外,我们还建立了一个队列模型,以捕捉随机网络微积分(SNC)方法最大加理论下的队列动态。基于 max-plus 模型,我们应用联合约束和切尔诺夫不等式为等价问题推导出了闭式切尔诺夫上界。此外,我们还描述了适用于每个阶段的不同并行方法的服务流程。通过用服务时刻生成函数(MGF)求解切尔诺夫上界,我们得到了启发式更新周期解,用于最小化三阶段 WCPS 的 AoEVP。仿真结果验证了我们的分析,并证明我们的启发式更新周期解接近最优解,可最大限度地减少具有不同并行性的三阶段 WCPS 的 AoEVP。
{"title":"Age-of-Event Aware: Sampling Period Optimization in a Three-Stage Wireless Cyber-Physical System With Diverse Parallelisms","authors":"Yanxi Zhang;Muyu Mei;Dongqi Yan;Xu Zhang;Qinghai Yang;Mingwu Yao","doi":"10.1109/TPDS.2024.3405790","DOIUrl":"10.1109/TPDS.2024.3405790","url":null,"abstract":"With the emergence of parallel computing systems and distributed time-sensitive applications, it is urgent to provide statistical guarantees for age of information (AoI) in wireless cyber-physical systems (WCPS) with diverse parallelisms. However, most of the existing research on AoI have tended to focus on serial transmission, and the AoI performance of multi-stage parallel systems remains unclear. To help address these research gaps, in this work, we set out to investigate the age of event (AoE) violation probability in a three-stage WCPS with diverse parallelisms such as fork-join and split-merge. We analyze both transient and steady-state characteristics of AoE violation probability (AoEVP). Using these characteristics, we transform the AoEVP minimization problem into an equivalent minimization problem. Moreover, we develop a queuing model to capture the queue dynamics under the max-plus theory of stochastic network calculus (SNC) approach. Based on the max-plus model, we derive a closed-form Chernoff upper bound for the equivalent problem by applying the union bound and the Chernoff inequality. Furthermore, we characterize the service process for different parallelisms applicable to each stage. By solving the Chernoff upper bound with the service moment generation functions (MGFs), we obtain heuristic update period solutions for minimizing the AoEVP of three-stage WCPS. Simulation results validate our analysis and demonstrate that our heuristic update period solutions are near optimal for minimizing the AoEVP of three-stage WCPS with diverse parallelisms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141168933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HybridChain: Fast, Accurate, and Secure Transaction Processing With Distributed Learning 混合链:利用分布式学习实现快速、准确和安全的交易处理
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-26 DOI: 10.1109/TPDS.2024.3381593
Amirhossein Taherpour;Xiaodong Wang
In order to fully unlock the transformative power of distributed ledgers and blockchains, it is crucial to develop innovative consensus algorithms that can overcome the obstacles of security, scalability, and interoperability, which currently hinder their widespread adoption. This paper introduces HybridChain that combines the advantages of sharded blockchain and DAG distributed ledger, and a consensus algorithm that leverages decentralized learning. Our approach involves validators exchanging perceptions as votes to assess potential conflicts between transactions and the witness set, representing input transactions in the UTXO model. These perceptions collectively contribute to an intermediate belief regarding the validity of transactions. By integrating their beliefs with those of other validators, localized decisions are made to determine validity. Ultimately, a final consensus is achieved through a majority vote, ensuring precise and efficient validation of transactions. Our proposed approach is compared to the existing DAG-based scheme IOTA and the sharded blockchain Omniledger through extensive simulations. The results show that IOTA has high throughput and low latency but sacrifices accuracy and is vulnerable to orphanage attacks especially with low transaction rates. Omniledger achieves stable accuracy by increasing shards but has increased latency. In contrast, the proposed HybridChain exhibits fast, accurate, and secure transaction processing, and excellent scalability.
为了充分释放分布式账本和区块链的变革力量,必须开发创新的共识算法,以克服目前阻碍其广泛应用的安全性、可扩展性和互操作性等障碍。本文介绍了混合链(HybridChain),它结合了分片区块链和 DAG 分布式分类账的优势,以及一种利用去中心化学习的共识算法。我们的方法涉及验证者交换看法作为选票,以评估交易和证人集(代表UTXO模型中的输入交易)之间的潜在冲突。这些看法共同构成了关于交易有效性的中间信念。通过将他们的信念与其他验证者的信念相结合,就地做出决定以确定有效性。最终,通过多数投票达成最终共识,确保交易验证的精确性和高效性。我们提出的方法通过大量模拟,与现有的基于 DAG 的方案 IOTA 和分片区块链 Omniledger 进行了比较。结果表明,IOTA 具有高吞吐量和低延迟的特点,但牺牲了准确性,而且容易受到孤儿攻击,尤其是在交易率较低的情况下。Omniledger 通过增加分片实现了稳定的准确性,但延迟增加了。相比之下,拟议的 HybridChain 具有快速、准确、安全的交易处理能力和出色的可扩展性。
{"title":"HybridChain: Fast, Accurate, and Secure Transaction Processing With Distributed Learning","authors":"Amirhossein Taherpour;Xiaodong Wang","doi":"10.1109/TPDS.2024.3381593","DOIUrl":"10.1109/TPDS.2024.3381593","url":null,"abstract":"In order to fully unlock the transformative power of distributed ledgers and blockchains, it is crucial to develop innovative consensus algorithms that can overcome the obstacles of security, scalability, and interoperability, which currently hinder their widespread adoption. This paper introduces HybridChain that combines the advantages of sharded blockchain and DAG distributed ledger, and a consensus algorithm that leverages decentralized learning. Our approach involves validators exchanging perceptions as votes to assess potential conflicts between transactions and the witness set, representing input transactions in the UTXO model. These perceptions collectively contribute to an intermediate belief regarding the validity of transactions. By integrating their beliefs with those of other validators, localized decisions are made to determine validity. Ultimately, a final consensus is achieved through a majority vote, ensuring precise and efficient validation of transactions. Our proposed approach is compared to the existing DAG-based scheme IOTA and the sharded blockchain Omniledger through extensive simulations. The results show that IOTA has high throughput and low latency but sacrifices accuracy and is vulnerable to orphanage attacks especially with low transaction rates. Omniledger achieves stable accuracy by increasing shards but has increased latency. In contrast, the proposed HybridChain exhibits fast, accurate, and secure transaction processing, and excellent scalability.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140316450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AtRec: Accelerating Recommendation Model Training on CPUs AtRec:加速 CPU 上的推荐模型训练
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-03-25 DOI: 10.1109/TPDS.2024.3381186
Siqi Wang;Tianyu Feng;Hailong Yang;Xin You;Bangduo Chen;Tongxuan Liu;Zhongzhi Luan;Depei Qian
The popularity of recommendation models and the enhanced AI processing capability of CPUs have provided massive performance opportunities to deliver satisfactory experiences to a large number of users. Unfortunately, existing recommendation model training methods fail to achieve high efficiency due to unique challenges such as dynamic shape and high parallelism. To address the above limitations, we comprehensively study the distinctive characteristics of recommendation models and discover several unexploited optimization opportunities. To exploit such opportunities, we propose AtRec, a high-performant recommendation model training engine that significantly accelerates the training process on CPUs. Specifically, AtRec presents comprehensive approach of training that employs operator-level and graph-level joint optimizations and runtime optimization. At the operator-level, AtRec identifies and optimizes the time-consuming operators, which enables further efficient graph-level optimizations. At the graph-level, AtRec conducts an in-depth analysis of the inefficiencies in several frequently used subgraphs, enables further performance improvement via eliminating redundant computations and memory accesses. In addition, to achieve better runtime performance, AtRec also identifies inefficiencies prevalent in the current scheduling and proposes runtime batching. The experiment results demonstrate that AtRec can significantly outperform state-of-the-art recommendation model training engines. We have open sourced the implementation and corresponding data of AtRec to boost research in this direction.
推荐模型的普及和 CPU 人工智能处理能力的增强为向大量用户提供满意的体验提供了巨大的性能机遇。遗憾的是,由于动态形状和高并行性等独特挑战,现有的推荐模型训练方法无法实现高效率。为了解决上述局限性,我们全面研究了推荐模型的显著特征,并发现了几个尚未开发的优化机会。为了利用这些机会,我们提出了 AtRec,这是一个高性能的推荐模型训练引擎,能显著加速 CPU 上的训练过程。具体来说,AtRec 提出了全面的训练方法,其中包括操作员级和图级联合优化以及运行时优化。在算子级,AtRec 会识别并优化耗时的算子,从而进一步实现高效的图级优化。在图层面,AtRec 深入分析了几个常用子图中的低效问题,通过消除冗余计算和内存访问进一步提高了性能。此外,为了获得更好的运行时性能,AtRec 还识别了当前调度中普遍存在的低效问题,并提出了运行时批处理建议。实验结果表明,AtRec 的性能明显优于最先进的推荐模型训练引擎。我们已将 AtRec 的实现和相应数据开源,以促进该方向的研究。
{"title":"AtRec: Accelerating Recommendation Model Training on CPUs","authors":"Siqi Wang;Tianyu Feng;Hailong Yang;Xin You;Bangduo Chen;Tongxuan Liu;Zhongzhi Luan;Depei Qian","doi":"10.1109/TPDS.2024.3381186","DOIUrl":"10.1109/TPDS.2024.3381186","url":null,"abstract":"The popularity of recommendation models and the enhanced AI processing capability of CPUs have provided massive performance opportunities to deliver satisfactory experiences to a large number of users. Unfortunately, existing recommendation model training methods fail to achieve high efficiency due to unique challenges such as dynamic shape and high parallelism. To address the above limitations, we comprehensively study the distinctive characteristics of recommendation models and discover several unexploited optimization opportunities. To exploit such opportunities, we propose \u0000<italic>AtRec</i>\u0000, a high-performant recommendation model training engine that significantly accelerates the training process on CPUs. Specifically, \u0000<italic>AtRec</i>\u0000 presents comprehensive approach of training that employs operator-level and graph-level joint optimizations and runtime optimization. At the operator-level, \u0000<italic>AtRec</i>\u0000 identifies and optimizes the time-consuming operators, which enables further efficient graph-level optimizations. At the graph-level, \u0000<italic>AtRec</i>\u0000 conducts an in-depth analysis of the inefficiencies in several frequently used subgraphs, enables further performance improvement via eliminating redundant computations and memory accesses. In addition, to achieve better runtime performance, \u0000<italic>AtRec</i>\u0000 also identifies inefficiencies prevalent in the current scheduling and proposes runtime batching. The experiment results demonstrate that \u0000<italic>AtRec</i>\u0000 can significantly outperform state-of-the-art recommendation model training engines. We have open sourced the implementation and corresponding data of \u0000<italic>AtRec</i>\u0000 to boost research in this direction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140301497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RR-Compound: RDMA-Fused gRPC for Low Latency, High Throughput, and Easy Interface RR-Compound:RDMA 融合 gRPC,以简易接口实现低延迟和高吞吐量
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-03-23 DOI: 10.1109/TPDS.2024.3404394
Liang Geng;Hao Wang;Jingsong Meng;Dayi Fan;Sami Ben-Romdhane;Hari Kadayam Pichumani;Vinay Phegade;Xiaodong Zhang
Advanced data centers strive for high performance and throughput, which can be achieved through the desirable merits of Remote Procedure Call (RPC) programming model and the low latency of Remote Direct Memory Access (RDMA). However, despite the widespread availability of these software and hardware utilities, they have been utilized separately for their own applications in existing production systems for many years. Although researchers have attempted to develop RDMA-enabled RPC prototypes, they often face challenges such as API discrepancies and a lack of specific features for effective integration with major production software, rendering them incompatible. This industry R&D project aims to enhance the performance of gRPC, a widely utilized RPC framework in major companies, by integrating RDMA as an internal component. Our system solution, called RR-Compound, combines the simple user interface and other merits of gRPC with low latency for remote data accesses. RR-Compound is fully compatible with gRPC and can serve as a seamless replacement without altering existing applications. However, to achieve low latency, high throughput, and scalability for RR-Compound, several technical challenges in managing network connections and memory space utilization must be effectively addressed. To overcome the limitations of existing connection methods, we have developed a new method called BPEV that is independent of gRPC and applicable to all RDMA systems. We have also retained the asynchronous framework of gRPC, albeit with limited buffer space in RDMA memory management. In micro-benchmarks, RR-Compound outperforms mRPC - the state-of-the-art RPC framework for a large number of connections, achieving a 14.77% increase in throughput and a 42.55% reduction in latency. Subsequently, we compare RR-Compound with gRPC over IPoIB using two real-world applications: KV-Store and TensorFlow. RR-Compound achieves up to a 2.35x increase in throughput and reduces the average latency by 46.92%.
先进的数据中心追求高性能和高吞吐量,这可以通过远程过程调用(RPC)编程模型和远程直接内存访问(RDMA)的低延迟来实现。然而,尽管这些软件和硬件实用程序已广泛普及,但多年来它们一直被单独用于现有生产系统中的应用。虽然研究人员已尝试开发支持 RDMA 的 RPC 原型,但它们经常面临 API 不一致、缺乏与主要生产软件有效集成的特定功能等挑战,导致它们不兼容。本行业研发项目旨在通过将 RDMA 集成为内部组件来提高 gRPC 的性能,gRPC 是大型公司广泛使用的 RPC 框架。我们的系统解决方案名为 RR-Compound,它将简单的用户界面和 gRPC 的其他优点与远程数据访问的低延迟结合在一起。RR-Compound 与 gRPC 完全兼容,可作为无缝替代,无需更改现有应用程序。然而,要实现 RR-Compound 的低延迟、高吞吐量和可扩展性,必须有效解决管理网络连接和内存空间利用方面的若干技术难题。为了克服现有连接方法的局限性,我们开发了一种名为 BPEV 的新方法,它独立于 gRPC,适用于所有 RDMA 系统。我们还保留了 gRPC 的异步框架,尽管 RDMA 内存管理中的缓冲空间有限。在微基准测试中,RR-Compound 的表现优于 mRPC(大量连接的最先进 RPC 框架),吞吐量提高了 14.77%,延迟减少了 42.55%。随后,我们使用两个实际应用对 RR-Compound 和 IPoIB 上的 gRPC 进行了比较:KV-Store 和 TensorFlow。RR-Compound 的吞吐量提高了 2.35 倍,平均延迟降低了 46.92%。
{"title":"RR-Compound: RDMA-Fused gRPC for Low Latency, High Throughput, and Easy Interface","authors":"Liang Geng;Hao Wang;Jingsong Meng;Dayi Fan;Sami Ben-Romdhane;Hari Kadayam Pichumani;Vinay Phegade;Xiaodong Zhang","doi":"10.1109/TPDS.2024.3404394","DOIUrl":"10.1109/TPDS.2024.3404394","url":null,"abstract":"Advanced data centers strive for high performance and throughput, which can be achieved through the desirable merits of Remote Procedure Call (RPC) programming model and the low latency of Remote Direct Memory Access (RDMA). However, despite the widespread availability of these software and hardware utilities, they have been utilized separately for their own applications in existing production systems for many years. Although researchers have attempted to develop RDMA-enabled RPC prototypes, they often face challenges such as API discrepancies and a lack of specific features for effective integration with major production software, rendering them incompatible. This industry R&D project aims to enhance the performance of gRPC, a widely utilized RPC framework in major companies, by integrating RDMA as an internal component. Our system solution, called RR-Compound, combines the simple user interface and other merits of gRPC with low latency for remote data accesses. RR-Compound is fully compatible with gRPC and can serve as a seamless replacement without altering existing applications. However, to achieve low latency, high throughput, and scalability for RR-Compound, several technical challenges in managing network connections and memory space utilization must be effectively addressed. To overcome the limitations of existing connection methods, we have developed a new method called BPEV that is independent of gRPC and applicable to all RDMA systems. We have also retained the asynchronous framework of gRPC, albeit with limited buffer space in RDMA memory management. In micro-benchmarks, RR-Compound outperforms mRPC - the state-of-the-art RPC framework for a large number of connections, achieving a 14.77% increase in throughput and a 42.55% reduction in latency. Subsequently, we compare RR-Compound with gRPC over IPoIB using two real-world applications: KV-Store and TensorFlow. RR-Compound achieves up to a 2.35x increase in throughput and reduces the average latency by 46.92%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1