首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks 基于张量核稀疏核的深度神经网络加速
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-26 DOI: 10.1109/TPDS.2025.3637268
Shijie Lv;Debin Liu;Laurence T. Yang;Xiaosong Peng;Ruonan Zhao;Zecan Yang;Jun Feng
Large language models in deep learning have numerous parameters, requiring significant storage space and computational resources. Compression techniques are highly effective in addressing these challenges. With the development of hardware like Graphics Processing Unit (GPU), Tensor Core can accelerate low-precision matrix multiplication but achieve acceleration for sparse matrices is challenging. Due to its sparsity, the utilization of Tensor Cores is relatively low. To address this, we propose the based on Tensor Core Compressed Sparse Row format (TC-CSR), which facilitates data loading on GPUs and matrix operations on Tensor Cores. Based on this format, we designed block Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM) kernels, which are common operations in deep learning. Utilizing these designs, we achieved a $mathbf {1.41times }$ speedup on Sputnik in scenarios of moderate sparsity and a $mathbf {1.38times }$ speedup with large-scale highly sparse matrices. Benefit from our design, we achieved a $mathbf {1.75times }$ speedup in end-to-end inference with sparse Transformers and save memory.
深度学习中的大型语言模型有许多参数,需要大量的存储空间和计算资源。压缩技术在解决这些挑战方面非常有效。随着图形处理器(GPU)等硬件的发展,Tensor Core可以加速低精度的矩阵乘法,但实现稀疏矩阵的加速是一个挑战。由于其稀疏性,张量核的利用率相对较低。为了解决这个问题,我们提出了基于张量核心压缩稀疏行格式(TC-CSR),该格式便于在gpu上加载数据和在张量核心上进行矩阵运算。基于这种格式,我们设计了深度学习中常用的块稀疏矩阵-矩阵乘法(SpMM)和采样稠密矩阵乘法(SDDMM)核。利用这些设计,我们在Sputnik上实现了中等稀疏度场景下的$mathbf {1.41times}$加速,在大规模高度稀疏矩阵下实现了$mathbf {1.38times}$加速。得益于我们的设计,我们在使用稀疏变压器的端到端推理中实现了$mathbf {1.75times}$的加速,并节省了内存。
{"title":"Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks","authors":"Shijie Lv;Debin Liu;Laurence T. Yang;Xiaosong Peng;Ruonan Zhao;Zecan Yang;Jun Feng","doi":"10.1109/TPDS.2025.3637268","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637268","url":null,"abstract":"Large language models in deep learning have numerous parameters, requiring significant storage space and computational resources. Compression techniques are highly effective in addressing these challenges. With the development of hardware like Graphics Processing Unit (GPU), Tensor Core can accelerate low-precision matrix multiplication but achieve acceleration for sparse matrices is challenging. Due to its sparsity, the utilization of Tensor Cores is relatively low. To address this, we propose the based on <b>T</b>ensor <b>C</b>ore <b>C</b>ompressed <b>S</b>parse <b>R</b>ow format (TC-CSR), which facilitates data loading on GPUs and matrix operations on Tensor Cores. Based on this format, we designed block Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM) kernels, which are common operations in deep learning. Utilizing these designs, we achieved a <inline-formula><tex-math>$mathbf {1.41times }$</tex-math></inline-formula> speedup on Sputnik in scenarios of moderate sparsity and a <inline-formula><tex-math>$mathbf {1.38times }$</tex-math></inline-formula> speedup with large-scale highly sparse matrices. Benefit from our design, we achieved a <inline-formula><tex-math>$mathbf {1.75times }$</tex-math></inline-formula> speedup in end-to-end inference with sparse Transformers and save memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"353-364"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HashTAG With CALM: Low-Overhead Hardware Support for Inter-Task Eviction Monitoring HashTAG With CALM:低开销硬件支持任务间驱逐监控
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-26 DOI: 10.1109/TPDS.2025.3637171
Pablo Andreu;Pedro López;Carles Hernández
Multicore processors have emerged as the preferred architecture for safetycritical systems due to their significant performance advantages. However, concurrent access by multiple cores to a shared cache induces intercore evictions that generate nondeterministic interference and compromise timing predictability. Static partitioning of the cache among cores is a wellestablished countermeasure that effectively eliminates such evictions but reduces flexibility and system throughput. To accurately estimate inter-core cache contention, Auxiliary Tag Directories (ATDs) are widely adopted. However, ATDs incur substantial hardware area costs, which often motivates the use of heuristic-based reductions. These reduced ATD designs, while more compact, compromise accuracy and therefore are not suitable for safety-critical domains. This paper extends the proposal of HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG introduces a safe and lightweight Auxiliary Tag Directory mechanism that tracks which cores are responsible for evicting cache lines used by others, thus measuring contention. We further refine the proposed HashTAG approach by creating CALM, a custom-made memory allocator that significantly improves HashTAG performance in multicore systems. Our results show that no inter-task interference underprediction is possible with HashTAG, making it suitable for the safety domain. HashTAG provides a 47% reduction in the Auxiliary Tag Directory area, presenting perfect measurements on 80% of cases and only a 1% error on maximum inter-core eviction measurements for a HashTAG tag size of ten bits.
由于其显著的性能优势,多核处理器已成为安全关键系统的首选架构。然而,多核对共享缓存的并发访问会导致核间驱逐,从而产生不确定性干扰并损害时间可预测性。在内核之间对缓存进行静态分区是一种完善的对策,它可以有效地消除这种清除,但会降低灵活性和系统吞吐量。为了准确估计内核间的缓存竞争,辅助标签目录(ATDs)被广泛采用。然而,atd会产生大量的硬件面积成本,这通常促使人们使用基于启发式的减少方法。这些简化的ATD设计虽然更紧凑,但会影响精度,因此不适合安全关键领域。本文扩展了HashTAG的提出,这是一种精确上界核间驱逐干扰的新方法。HashTAG引入了一种安全且轻量级的辅助标记目录机制,该机制跟踪哪些核负责清除其他核使用的缓存行,从而测量争用。我们通过创建定制的内存分配器CALM进一步完善了所提出的HashTAG方法,该内存分配器可以显著提高多核系统中的HashTAG性能。我们的研究结果表明,HashTAG不可能出现任务间干扰低预测,使其适用于安全领域。HashTAG在辅助标记目录区域中减少了47%,在80%的情况下提供了完美的测量,并且在HashTAG标记大小为10位的最大核间驱逐测量中只有1%的错误。
{"title":"HashTAG With CALM: Low-Overhead Hardware Support for Inter-Task Eviction Monitoring","authors":"Pablo Andreu;Pedro López;Carles Hernández","doi":"10.1109/TPDS.2025.3637171","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637171","url":null,"abstract":"Multicore processors have emerged as the preferred architecture for safetycritical systems due to their significant performance advantages. However, concurrent access by multiple cores to a shared cache induces intercore evictions that generate nondeterministic interference and compromise timing predictability. Static partitioning of the cache among cores is a wellestablished countermeasure that effectively eliminates such evictions but reduces flexibility and system throughput. To accurately estimate inter-core cache contention, Auxiliary Tag Directories (ATDs) are widely adopted. However, ATDs incur substantial hardware area costs, which often motivates the use of heuristic-based reductions. These reduced ATD designs, while more compact, compromise accuracy and therefore are not suitable for safety-critical domains. This paper extends the proposal of HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG introduces a safe and lightweight Auxiliary Tag Directory mechanism that tracks which cores are responsible for evicting cache lines used by others, thus measuring contention. We further refine the proposed HashTAG approach by creating CALM, a custom-made memory allocator that significantly improves HashTAG performance in multicore systems. Our results show that no inter-task interference underprediction is possible with HashTAG, making it suitable for the safety domain. HashTAG provides a 47% reduction in the Auxiliary Tag Directory area, presenting perfect measurements on 80% of cases and only a 1% error on maximum inter-core eviction measurements for a HashTAG tag size of ten bits.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"340-352"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11269742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NVMe-oF-R: Fast Recovery Design on Disaggregated Distributed Storage System NVMe-oF-R:分解分布式存储系统的快速恢复设计
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-25 DOI: 10.1109/TPDS.2025.3637057
Myoungwon Oh;Cheolho Kang;Sungmin Lee;Woojoong Kim;Yangwoo Roh;Jeong-Uk Kang;Silwan Chang
Failures in a large distributed storage system are often critical, leading to unexpected I/Os that are required to restore the system’s health and ensure availability. With the advent of NVMe-oF, the disaggregation of compute and storage resources presents an opportunity to minimize the negative impact of the compute failure by reattaching the storage resources. However, despite advances in hardware, modern distributed storage systems have not yet fully adapted to the disaggregated architecture. There are four main reasons: (1) lack of awareness of recoverable failure events in the disaggregated architecture, (2) incorrect availability management with respect to the NVMe-oF fault domains, (3) unnecessary data rebalance I/Os for uniform distribution triggered even after the failure is recovered, (4) load imbalance caused by asymmetric deployment of compute resources after blind relocation for recovery. To address these challenges, we introduce NVMe-oF-R, a resilient disaggregated distributed storage architecture for fast recovery. NVMe-oF-R comprises three techniques: (1) NVMe-oF adapter, which detects recoverable failure events and orchestrates relocation; (2) DCRUSH, a data placement strategy that considers the NVMe-oF based disaggregation architecture; and (3) Relocater, which efficiently relocates failed compute resources and fixes stragglers that arise after recovery. We implement NVMe-oF-R atop the storage orchestration layer in a CRUSH-based distributed storage system, Ceph. Our experimental results demonstrate that NVMe-oF-R can eliminate unnecessary recovery traffic and reduce recovery time by more than 50% .
在大型分布式存储系统中,故障通常是非常严重的,会导致意外的I/ o,这些I/ o是恢复系统健康和确保可用性所必需的。随着NVMe-oF的出现,计算和存储资源的分解提供了一个机会,可以通过重新连接存储资源来最小化计算故障的负面影响。然而,尽管硬件进步,现代分布式存储系统还没有完全适应分解架构。主要有四个原因:(1)在分解架构中缺乏对可恢复故障事件的意识;(2)对NVMe-oF故障域的可用性管理不正确;(3)即使在故障恢复后也会触发不必要的数据rebalance I/ o以实现均匀分布;(4)盲迁移恢复后计算资源部署不对称导致负载不平衡。为了应对这些挑战,我们引入了NVMe-oF-R,这是一种用于快速恢复的弹性分解分布式存储架构。NVMe-oF- r包括三种技术:(1)NVMe-oF适配器,用于检测可恢复的故障事件并编排重新定位;(2) DCRUSH,一种考虑基于NVMe-oF分解架构的数据放置策略;(3) Relocater,有效地重新定位失效的计算资源,修复恢复后出现的掉队者。我们在基于crush的分布式存储系统Ceph的存储编排层之上实现NVMe-oF-R。实验结果表明,NVMe-oF-R可以消除不必要的恢复流量,并将恢复时间减少50%以上。
{"title":"NVMe-oF-R: Fast Recovery Design on Disaggregated Distributed Storage System","authors":"Myoungwon Oh;Cheolho Kang;Sungmin Lee;Woojoong Kim;Yangwoo Roh;Jeong-Uk Kang;Silwan Chang","doi":"10.1109/TPDS.2025.3637057","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637057","url":null,"abstract":"Failures in a large distributed storage system are often critical, leading to unexpected I/Os that are required to restore the system’s health and ensure availability. With the advent of NVMe-oF, the disaggregation of compute and storage resources presents an opportunity to minimize the negative impact of the compute failure by reattaching the storage resources. However, despite advances in hardware, modern distributed storage systems have not yet fully adapted to the disaggregated architecture. There are four main reasons: (1) lack of awareness of recoverable failure events in the disaggregated architecture, (2) incorrect availability management with respect to the NVMe-oF fault domains, (3) unnecessary data rebalance I/Os for uniform distribution triggered even after the failure is recovered, (4) load imbalance caused by asymmetric deployment of compute resources after blind relocation for recovery. To address these challenges, we introduce <italic>NVMe-oF-R</i>, a resilient disaggregated distributed storage architecture for fast recovery. <italic>NVMe-oF-R</i> comprises three techniques: (1) <italic>NVMe-oF adapter</i>, which detects recoverable failure events and orchestrates relocation; (2) <italic>DCRUSH</i>, a data placement strategy that considers the NVMe-oF based disaggregation architecture; and (3) <italic>Relocater</i>, which efficiently relocates failed compute resources and fixes stragglers that arise after recovery. We implement <italic>NVMe-oF-R</i> atop the storage orchestration layer in a CRUSH-based distributed storage system, Ceph. Our experimental results demonstrate that <italic>NVMe-oF-R</i> can eliminate unnecessary recovery traffic and reduce recovery time by more than 50% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"380-394"},"PeriodicalIF":6.0,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
cuFastTuckerPlusTC: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores cuFastTuckerPlusTC:基于GPU张量核的随机并行稀疏FastTucker分解
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-24 DOI: 10.1109/TPDS.2025.3636547
Zixuan Li;Mingxing Duan;Huizhang Luo;Wangdong Yang;Kenli Li;Keqin Li
Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this article, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlusTC, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that the proposed method achieves a $2times$ to $8times$ improvement in convergence speed and a $3times$ to $5times$ improvement in per-iteration execution speed compared with state-of-the-art algorithms.
稀疏张量在现实世界的应用中很普遍,通常以其大规模、高阶和高维的特性为特征。由于涉及到大量的内存和计算开销,直接处理原始张量是不切实际的。目前的主流方法包括压缩或分解原始张量。一个流行的张量分解算法是塔克分解。然而,现有的大规模Tucker分解算法通常将原优化问题分解为多个凸优化问题,以保证多项式收敛。不幸的是,这些算法往往收敛缓慢。相比之下,张量分解呈现出一个简单的优化场景,使得局部搜索算法能够更快地收敛到全局(近似)优化。在本文中,我们提出了FastTuckerPlus算法,该算法将原优化问题分解为两个非凸优化问题,并使用随机梯度下降法交替求解。此外,我们介绍了cuFastTuckerPlusTC,这是一种为GPU平台设计的细粒度并行算法,利用了张量核的性能。该算法最大限度地减少了内存访问开销和计算成本,超越了最先进的算法。我们的实验结果表明,与最先进的算法相比,所提出的方法在收敛速度上提高了2倍到8倍,在每次迭代执行速度上提高了3倍到5倍。
{"title":"cuFastTuckerPlusTC: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores","authors":"Zixuan Li;Mingxing Duan;Huizhang Luo;Wangdong Yang;Kenli Li;Keqin Li","doi":"10.1109/TPDS.2025.3636547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3636547","url":null,"abstract":"Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this article, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlusTC, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that the proposed method achieves a <inline-formula><tex-math>$2times$</tex-math></inline-formula> to <inline-formula><tex-math>$8times$</tex-math></inline-formula> improvement in convergence speed and a <inline-formula><tex-math>$3times$</tex-math></inline-formula> to <inline-formula><tex-math>$5times$</tex-math></inline-formula> improvement in per-iteration execution speed compared with state-of-the-art algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"443-458"},"PeriodicalIF":6.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GLPilot: Efficient Distributed GNN Training With Learnable Embeddings GLPilot:具有可学习嵌入的高效分布式GNN训练
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-24 DOI: 10.1109/TPDS.2025.3636057
Chengru Yang;Chaoyi Ruan;Chengjie Tang;Ping Gong;Shiyi Wang;Xiang Song;Cheng Li
Graph Neural Networks (GNNs) with learnable vertex embeddings enable models to infer rich, task-specific representations even when vertex features are sparse, noisy, or missing. In large-scale multi-GPU training, dynamically updated embeddings, often orders of magnitude larger than model parameters, severely degrade training efficiency. Specifically, loading remote embeddings and synchronizing their gradients collectively account for over 90% of per-iteration time. Traditional caching and parallelism approaches, designed for static embeddings or model parameters alone, are ineffective at mitigating this “data wall” of embedding-related transfers. To address this, we begin with a detailed analysis of vertex access patterns over training iterations and find that infrequently sampled vertices, despite incurring the majority of embedding-loading latency, undergo very few updates, making their embeddings ideal candidates for staleness reuse. Driven by this, we propose GLPilot, a novel system that mitigates embedding-related bottlenecks. GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches and a local gradient aggregation technique to minimize redundant communications during synchronization. Additionally, GLPilot utilizes an on-GPU cache for keeping mostly updated embeddings to alleviate CPU-GPU data transfer bottlenecks. Our evaluations on a 32-GPU cluster using two popular GNN models, three datasets and two optimizers demonstrate that GLPilot consistently achieves 1.28–1.93× per-epoch training speedups, in comparison with two strong baselines such as DGL and P3, while maintaining comparable model accuracy.
具有可学习顶点嵌入的图神经网络(gnn)使模型能够在顶点特征稀疏、有噪声或缺失的情况下推断出丰富的、特定于任务的表示。在大规模多gpu训练中,动态更新的嵌入量往往比模型参数大几个数量级,严重降低了训练效率。具体来说,加载远程嵌入并同步它们的梯度总共占每次迭代时间的90%以上。传统的缓存和并行方法,仅为静态嵌入或模型参数设计,在缓解嵌入相关传输的“数据墙”方面是无效的。为了解决这个问题,我们首先详细分析了训练迭代中的顶点访问模式,并发现不经常采样的顶点,尽管会产生大部分嵌入加载延迟,但很少进行更新,这使得它们的嵌入成为过时重用的理想候选。基于此,我们提出了一种新的系统GLPilot,它可以缓解嵌入式相关的瓶颈。GLPilot引入了一种过时边界嵌入缓冲机制来减少远程获取,并引入了一种局部梯度聚合技术来减少同步过程中的冗余通信。此外,GLPilot利用gpu上的缓存来保持大多数更新的嵌入,以缓解CPU-GPU数据传输瓶颈。我们使用两种流行的GNN模型、三个数据集和两个优化器在32 gpu集群上进行的评估表明,与DGL和P3等两个强大的基线相比,GLPilot始终能够实现1.28 - 1.93 x的每历元训练速度,同时保持相当的模型准确性。
{"title":"GLPilot: Efficient Distributed GNN Training With Learnable Embeddings","authors":"Chengru Yang;Chaoyi Ruan;Chengjie Tang;Ping Gong;Shiyi Wang;Xiang Song;Cheng Li","doi":"10.1109/TPDS.2025.3636057","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3636057","url":null,"abstract":"Graph Neural Networks (GNNs) with learnable vertex embeddings enable models to infer rich, task-specific representations even when vertex features are sparse, noisy, or missing. In large-scale multi-GPU training, dynamically updated embeddings, often orders of magnitude larger than model parameters, severely degrade training efficiency. Specifically, loading remote embeddings and synchronizing their gradients collectively account for over 90% of per-iteration time. Traditional caching and parallelism approaches, designed for static embeddings or model parameters alone, are ineffective at mitigating this “data wall” of embedding-related transfers. To address this, we begin with a detailed analysis of vertex access patterns over training iterations and find that infrequently sampled vertices, despite incurring the majority of embedding-loading latency, undergo very few updates, making their embeddings ideal candidates for staleness reuse. Driven by this, we propose GLPilot, a novel system that mitigates embedding-related bottlenecks. GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches and a local gradient aggregation technique to minimize redundant communications during synchronization. Additionally, GLPilot utilizes an on-GPU cache for keeping mostly updated embeddings to alleviate CPU-GPU data transfer bottlenecks. Our evaluations on a 32-GPU cluster using two popular GNN models, three datasets and two optimizers demonstrate that GLPilot consistently achieves 1.28–1.93× per-epoch training speedups, in comparison with two strong baselines such as DGL and P3, while maintaining comparable model accuracy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"489-503"},"PeriodicalIF":6.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fully Decentralized Data Distribution for Large-Scale HPC Systems 大规模高性能计算系统的完全分散数据分布
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-17 DOI: 10.1109/TPDS.2025.3633298
Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu
For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.
多年来,在HPC数据分布场景中,随着HPC系统规模的不断增加,厂商不得不增加数据提供者的数量,以提高IO并行性来匹配数据需求者。在大规模的,特别是百亿亿次的HPC系统中,这种将需求者和提供者分离的模式存在着显著的可扩展性限制,并带来了巨大的成本。我们认为,只有需求方同时充当供方的分布模式才能从根本上应对规模的变化,并具有最佳的可扩展性,本文称之为all-to-all数据分布模式。我们在HPC系统的计算网络上设计并实现了BitTorrent协议,并提出了一种完全去中心化的数据分发方法FD3。为了提高FD3的性能,我们根据HPC网络环境的特点设计了请求验证表(request -to- validated Table, RVT)以及最高排名和最长连续段优先(HLF)策略。此外,我们设计了种子树来加速种子文件数据的分发和分布状态的聚合,并利用邻域局部生成算法释放跟踪器负载。实验结果表明,FD3可以平滑地扩展到11k+计算节点,其性能远远优于并行文件系统。与原版BitTorrent相比,性能提升了8-15倍。FD3强调了全对全模型在高性能计算数据分布场景中的巨大潜力。此外,本文的工作可以进一步激发对未来分布式并行文件系统的探索,并为Exscale HPC系统的数据访问模式设计提供基础和启发。
{"title":"Fully Decentralized Data Distribution for Large-Scale HPC Systems","authors":"Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu","doi":"10.1109/TPDS.2025.3633298","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3633298","url":null,"abstract":"For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"304-321"},"PeriodicalIF":6.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM DAHBM-GCN:具有多数据流和HBM的灵活图卷积网络加速器
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-12 DOI: 10.1109/TPDS.2025.3632073
Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li
Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.
图结构数据在交通网络、分子网络、电子商务网络等领域得到了广泛的应用。图卷积网络(GCN)已成为处理非欧几里得图数据的一种有效方法。然而,图数据集的大小和稀疏性的变化,加上GCN计算中数据流模式对图数据的依赖性,使得GCN推理的加速越来越具有挑战性。提出了一种基于多数据流和高带宽存储器(HBM)的GCN推理加速器,命名为DAHBM-GCN。首先,我们设计了一个支持多数据流、聚合优先和组合优先顺序的计算引擎。在此基础上,提出了一种基于决策树的多数据流计算引擎自适应选择器,用于选择最优的数据流计算引擎。其次,为多通道HBM设计了一种有效的伪信道映射(pc),以增强带宽,有效缓解内存延迟和带宽瓶颈。第三,提出了一种GCN混合不动点量化策略,在不损失精度的前提下,降低了GCN模型的计算复杂度和参数数量。最后,广泛的性能评估实验表明,在各种数据集上,DAHBM-GCN与PyG-GCN和DGL-GCN相比,在CPU上的平均速度分别为52.5 - 129.3倍和4.9 - 7.9倍。与基于fpga的AWB-GCN、HyGCN、HLS-GCN和GCNAX加速器相比,DAHBM-GCN在不同数据集上的平均速度分别为1.21-2.21倍、1.25-1.98倍、1.65-2.68倍和1.18-1.56倍。此外,DAHBM-GCN具有高灵活性和低能耗的优点。
{"title":"DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM","authors":"Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2025.3632073","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632073","url":null,"abstract":"Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"213-229"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity HyFaaS:通过释放混合资源弹性来加速无服务器工作流
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-12 DOI: 10.1109/TPDS.2025.3632089
Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers
Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for computation-communication-separated orchestration to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.
无服务器计算保证了细粒度的资源弹性和计费,使其成为将复杂应用程序构建为多阶段工作流的一种有吸引力的方式。尽管如此,现有的工作流编排忽略了阶段内计算和通信部分的异构需求,这可能导致任何一方的资源效率低下。在本文中,我们提倡计算-通信分离的编排,以释放混合资源(即计算和网络)的弹性。我们介绍HyFaaS,这是一种无服务器工作流编排器,可在确保成本效率的同时提高性能。它将计算和通信无缝解耦,作为一系列混合阶段,在HyDAG中重新表达,这是一种新颖的工作流抽象。HyFaaS使用灰盒分析模型来识别其pareto最优饱和配置,然后部署饱和工作流,通过两级HyDAG分区来处理通信和扩展开销。随着事件驱动的运行时微调,HyFaaS进一步缩小了非关键阶段,通过分支感知协调来降低成本。实验结果表明,HyFaaS的端到端延迟比现有方案提高了32.7% ~ 50.4%,同时降低了1.37倍的成本。
{"title":"HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity","authors":"Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers","doi":"10.1109/TPDS.2025.3632089","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632089","url":null,"abstract":"Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for <italic>computation-communication-separated orchestration</i> to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"272-286"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers 高能效地理分布式数据中心任务调度和热管理的双时间尺度优化
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-11 DOI: 10.1109/TPDS.2025.3631654
Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen
The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.
人工智能(AI)的激增加剧了计算密集型任务,急剧增加了对地理分布式数据中心节能管理的需求。由于不匹配的时间常数、随机信息技术(IT)工作负载、可变的可再生能源和波动的电价,现有的方法难以协调任务调度和冷却控制。为了应对这些挑战,我们提出了D3T,这是一个双时间尺度深度强化学习(DRL)框架,可共同优化节能地理分布式数据中心的任务调度和热管理。在快速时间尺度下,D3T采用Deep Q-Network (DQN)来调度任务,降低了运营支出(OPEX)和任务驻留时间。在慢时间尺度下,基于qmix的多智能体DRL方法通过动态调节气流速率来调节分布式数据中心的冷却,从而防止出现热点,减少能源浪费。使用TRNSYS进行了大量的实验,结果表明,与基线算法相比,D3T将IT子系统的OPEX降低了13%,冷却子系统的OPEX降低了29%,将电源使用效率(PUE)提高了7%,并在地理分布式数据中心中保持了更稳定的热安全。
{"title":"D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers","authors":"Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen","doi":"10.1109/TPDS.2025.3631654","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631654","url":null,"abstract":"The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"230-246"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to Evaluate Distributed Coordination Systems?–A Survey and Analysis 如何评估分布式协调系统?——调查与分析
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-11 DOI: 10.1109/TPDS.2025.3631614
Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
协调服务和协议是分布式系统的关键组件,对于提供一致性、容错性和可伸缩性至关重要。然而,由于缺乏针对分布式协调服务的标准基准测试和评估工具,协调服务的开发人员/研究人员要么使用NoSQL标准基准测试,而忽略了一致性、分布和容错性的评估;或者创建自己的特别微基准测试,跳过与其他服务的可比性。在本研究中,我们分析和比较了已知和广泛使用的共识算法、分布式协调服务以及基于这些服务构建的分布式应用程序的评估机制。我们确定了分布式协调服务基准测试的最重要需求,例如用于评估这些系统的性能、可伸缩性、可用性和一致性的度量和参数。最后,我们讨论了为什么现有的基准不能解决分布式协调系统评估的复杂需求。
{"title":"How to Evaluate Distributed Coordination Systems?–A Survey and Analysis","authors":"Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas","doi":"10.1109/TPDS.2025.3631614","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631614","url":null,"abstract":"Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"198-212"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1