Pub Date : 2025-11-26DOI: 10.1109/TPDS.2025.3637268
Shijie Lv;Debin Liu;Laurence T. Yang;Xiaosong Peng;Ruonan Zhao;Zecan Yang;Jun Feng
Large language models in deep learning have numerous parameters, requiring significant storage space and computational resources. Compression techniques are highly effective in addressing these challenges. With the development of hardware like Graphics Processing Unit (GPU), Tensor Core can accelerate low-precision matrix multiplication but achieve acceleration for sparse matrices is challenging. Due to its sparsity, the utilization of Tensor Cores is relatively low. To address this, we propose the based on Tensor Core Compressed Sparse Row format (TC-CSR), which facilitates data loading on GPUs and matrix operations on Tensor Cores. Based on this format, we designed block Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM) kernels, which are common operations in deep learning. Utilizing these designs, we achieved a $mathbf {1.41times }$ speedup on Sputnik in scenarios of moderate sparsity and a $mathbf {1.38times }$ speedup with large-scale highly sparse matrices. Benefit from our design, we achieved a $mathbf {1.75times }$ speedup in end-to-end inference with sparse Transformers and save memory.
{"title":"Based on Tensor Core Sparse Kernels Accelerating Deep Neural Networks","authors":"Shijie Lv;Debin Liu;Laurence T. Yang;Xiaosong Peng;Ruonan Zhao;Zecan Yang;Jun Feng","doi":"10.1109/TPDS.2025.3637268","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637268","url":null,"abstract":"Large language models in deep learning have numerous parameters, requiring significant storage space and computational resources. Compression techniques are highly effective in addressing these challenges. With the development of hardware like Graphics Processing Unit (GPU), Tensor Core can accelerate low-precision matrix multiplication but achieve acceleration for sparse matrices is challenging. Due to its sparsity, the utilization of Tensor Cores is relatively low. To address this, we propose the based on <b>T</b>ensor <b>C</b>ore <b>C</b>ompressed <b>S</b>parse <b>R</b>ow format (TC-CSR), which facilitates data loading on GPUs and matrix operations on Tensor Cores. Based on this format, we designed block Sparse Matrix-Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM) kernels, which are common operations in deep learning. Utilizing these designs, we achieved a <inline-formula><tex-math>$mathbf {1.41times }$</tex-math></inline-formula> speedup on Sputnik in scenarios of moderate sparsity and a <inline-formula><tex-math>$mathbf {1.38times }$</tex-math></inline-formula> speedup with large-scale highly sparse matrices. Benefit from our design, we achieved a <inline-formula><tex-math>$mathbf {1.75times }$</tex-math></inline-formula> speedup in end-to-end inference with sparse Transformers and save memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"353-364"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1109/TPDS.2025.3637171
Pablo Andreu;Pedro López;Carles Hernández
Multicore processors have emerged as the preferred architecture for safetycritical systems due to their significant performance advantages. However, concurrent access by multiple cores to a shared cache induces intercore evictions that generate nondeterministic interference and compromise timing predictability. Static partitioning of the cache among cores is a wellestablished countermeasure that effectively eliminates such evictions but reduces flexibility and system throughput. To accurately estimate inter-core cache contention, Auxiliary Tag Directories (ATDs) are widely adopted. However, ATDs incur substantial hardware area costs, which often motivates the use of heuristic-based reductions. These reduced ATD designs, while more compact, compromise accuracy and therefore are not suitable for safety-critical domains. This paper extends the proposal of HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG introduces a safe and lightweight Auxiliary Tag Directory mechanism that tracks which cores are responsible for evicting cache lines used by others, thus measuring contention. We further refine the proposed HashTAG approach by creating CALM, a custom-made memory allocator that significantly improves HashTAG performance in multicore systems. Our results show that no inter-task interference underprediction is possible with HashTAG, making it suitable for the safety domain. HashTAG provides a 47% reduction in the Auxiliary Tag Directory area, presenting perfect measurements on 80% of cases and only a 1% error on maximum inter-core eviction measurements for a HashTAG tag size of ten bits.
{"title":"HashTAG With CALM: Low-Overhead Hardware Support for Inter-Task Eviction Monitoring","authors":"Pablo Andreu;Pedro López;Carles Hernández","doi":"10.1109/TPDS.2025.3637171","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637171","url":null,"abstract":"Multicore processors have emerged as the preferred architecture for safetycritical systems due to their significant performance advantages. However, concurrent access by multiple cores to a shared cache induces intercore evictions that generate nondeterministic interference and compromise timing predictability. Static partitioning of the cache among cores is a wellestablished countermeasure that effectively eliminates such evictions but reduces flexibility and system throughput. To accurately estimate inter-core cache contention, Auxiliary Tag Directories (ATDs) are widely adopted. However, ATDs incur substantial hardware area costs, which often motivates the use of heuristic-based reductions. These reduced ATD designs, while more compact, compromise accuracy and therefore are not suitable for safety-critical domains. This paper extends the proposal of HashTAG, a novel approach to accurately upper-bound inter-core eviction interference. HashTAG introduces a safe and lightweight Auxiliary Tag Directory mechanism that tracks which cores are responsible for evicting cache lines used by others, thus measuring contention. We further refine the proposed HashTAG approach by creating CALM, a custom-made memory allocator that significantly improves HashTAG performance in multicore systems. Our results show that no inter-task interference underprediction is possible with HashTAG, making it suitable for the safety domain. HashTAG provides a 47% reduction in the Auxiliary Tag Directory area, presenting perfect measurements on 80% of cases and only a 1% error on maximum inter-core eviction measurements for a HashTAG tag size of ten bits.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"340-352"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11269742","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Failures in a large distributed storage system are often critical, leading to unexpected I/Os that are required to restore the system’s health and ensure availability. With the advent of NVMe-oF, the disaggregation of compute and storage resources presents an opportunity to minimize the negative impact of the compute failure by reattaching the storage resources. However, despite advances in hardware, modern distributed storage systems have not yet fully adapted to the disaggregated architecture. There are four main reasons: (1) lack of awareness of recoverable failure events in the disaggregated architecture, (2) incorrect availability management with respect to the NVMe-oF fault domains, (3) unnecessary data rebalance I/Os for uniform distribution triggered even after the failure is recovered, (4) load imbalance caused by asymmetric deployment of compute resources after blind relocation for recovery. To address these challenges, we introduce NVMe-oF-R, a resilient disaggregated distributed storage architecture for fast recovery. NVMe-oF-R comprises three techniques: (1) NVMe-oF adapter, which detects recoverable failure events and orchestrates relocation; (2) DCRUSH, a data placement strategy that considers the NVMe-oF based disaggregation architecture; and (3) Relocater, which efficiently relocates failed compute resources and fixes stragglers that arise after recovery. We implement NVMe-oF-R atop the storage orchestration layer in a CRUSH-based distributed storage system, Ceph. Our experimental results demonstrate that NVMe-oF-R can eliminate unnecessary recovery traffic and reduce recovery time by more than 50% .
{"title":"NVMe-oF-R: Fast Recovery Design on Disaggregated Distributed Storage System","authors":"Myoungwon Oh;Cheolho Kang;Sungmin Lee;Woojoong Kim;Yangwoo Roh;Jeong-Uk Kang;Silwan Chang","doi":"10.1109/TPDS.2025.3637057","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637057","url":null,"abstract":"Failures in a large distributed storage system are often critical, leading to unexpected I/Os that are required to restore the system’s health and ensure availability. With the advent of NVMe-oF, the disaggregation of compute and storage resources presents an opportunity to minimize the negative impact of the compute failure by reattaching the storage resources. However, despite advances in hardware, modern distributed storage systems have not yet fully adapted to the disaggregated architecture. There are four main reasons: (1) lack of awareness of recoverable failure events in the disaggregated architecture, (2) incorrect availability management with respect to the NVMe-oF fault domains, (3) unnecessary data rebalance I/Os for uniform distribution triggered even after the failure is recovered, (4) load imbalance caused by asymmetric deployment of compute resources after blind relocation for recovery. To address these challenges, we introduce <italic>NVMe-oF-R</i>, a resilient disaggregated distributed storage architecture for fast recovery. <italic>NVMe-oF-R</i> comprises three techniques: (1) <italic>NVMe-oF adapter</i>, which detects recoverable failure events and orchestrates relocation; (2) <italic>DCRUSH</i>, a data placement strategy that considers the NVMe-oF based disaggregation architecture; and (3) <italic>Relocater</i>, which efficiently relocates failed compute resources and fixes stragglers that arise after recovery. We implement <italic>NVMe-oF-R</i> atop the storage orchestration layer in a CRUSH-based distributed storage system, Ceph. Our experimental results demonstrate that <italic>NVMe-oF-R</i> can eliminate unnecessary recovery traffic and reduce recovery time by more than 50% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"380-394"},"PeriodicalIF":6.0,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145729299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TPDS.2025.3636547
Zixuan Li;Mingxing Duan;Huizhang Luo;Wangdong Yang;Kenli Li;Keqin Li
Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this article, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlusTC, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that the proposed method achieves a $2times$ to $8times$ improvement in convergence speed and a $3times$ to $5times$ improvement in per-iteration execution speed compared with state-of-the-art algorithms.
{"title":"cuFastTuckerPlusTC: A Stochastic Parallel Sparse FastTucker Decomposition Using GPU Tensor Cores","authors":"Zixuan Li;Mingxing Duan;Huizhang Luo;Wangdong Yang;Kenli Li;Keqin Li","doi":"10.1109/TPDS.2025.3636547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3636547","url":null,"abstract":"Sparse tensors are prevalent in real-world applications, often characterized by their large-scale, high-order, and high-dimensional nature. Directly handling raw tensors is impractical due to the significant memory and computational overhead involved. The current mainstream approach involves compressing or decomposing the original tensor. One popular tensor decomposition algorithm is the Tucker decomposition. However, existing state-of-the-art algorithms for large-scale Tucker decomposition typically relax the original optimization problem into multiple convex optimization problems to ensure polynomial convergence. Unfortunately, these algorithms tend to converge slowly. In contrast, tensor decomposition exhibits a simple optimization landscape, making local search algorithms capable of converging to a global (approximate) optimum much faster. In this article, we propose the FastTuckerPlus algorithm, which decomposes the original optimization problem into two non-convex optimization problems and solves them alternately using the Stochastic Gradient Descent method. Furthermore, we introduce cuFastTuckerPlusTC, a fine-grained parallel algorithm designed for GPU platforms, leveraging the performance of tensor cores. This algorithm minimizes memory access overhead and computational costs, surpassing the state-of-the-art algorithms. Our experimental results demonstrate that the proposed method achieves a <inline-formula><tex-math>$2times$</tex-math></inline-formula> to <inline-formula><tex-math>$8times$</tex-math></inline-formula> improvement in convergence speed and a <inline-formula><tex-math>$3times$</tex-math></inline-formula> to <inline-formula><tex-math>$5times$</tex-math></inline-formula> improvement in per-iteration execution speed compared with state-of-the-art algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"443-458"},"PeriodicalIF":6.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1109/TPDS.2025.3636057
Chengru Yang;Chaoyi Ruan;Chengjie Tang;Ping Gong;Shiyi Wang;Xiang Song;Cheng Li
Graph Neural Networks (GNNs) with learnable vertex embeddings enable models to infer rich, task-specific representations even when vertex features are sparse, noisy, or missing. In large-scale multi-GPU training, dynamically updated embeddings, often orders of magnitude larger than model parameters, severely degrade training efficiency. Specifically, loading remote embeddings and synchronizing their gradients collectively account for over 90% of per-iteration time. Traditional caching and parallelism approaches, designed for static embeddings or model parameters alone, are ineffective at mitigating this “data wall” of embedding-related transfers. To address this, we begin with a detailed analysis of vertex access patterns over training iterations and find that infrequently sampled vertices, despite incurring the majority of embedding-loading latency, undergo very few updates, making their embeddings ideal candidates for staleness reuse. Driven by this, we propose GLPilot, a novel system that mitigates embedding-related bottlenecks. GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches and a local gradient aggregation technique to minimize redundant communications during synchronization. Additionally, GLPilot utilizes an on-GPU cache for keeping mostly updated embeddings to alleviate CPU-GPU data transfer bottlenecks. Our evaluations on a 32-GPU cluster using two popular GNN models, three datasets and two optimizers demonstrate that GLPilot consistently achieves 1.28–1.93× per-epoch training speedups, in comparison with two strong baselines such as DGL and P3, while maintaining comparable model accuracy.
{"title":"GLPilot: Efficient Distributed GNN Training With Learnable Embeddings","authors":"Chengru Yang;Chaoyi Ruan;Chengjie Tang;Ping Gong;Shiyi Wang;Xiang Song;Cheng Li","doi":"10.1109/TPDS.2025.3636057","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3636057","url":null,"abstract":"Graph Neural Networks (GNNs) with learnable vertex embeddings enable models to infer rich, task-specific representations even when vertex features are sparse, noisy, or missing. In large-scale multi-GPU training, dynamically updated embeddings, often orders of magnitude larger than model parameters, severely degrade training efficiency. Specifically, loading remote embeddings and synchronizing their gradients collectively account for over 90% of per-iteration time. Traditional caching and parallelism approaches, designed for static embeddings or model parameters alone, are ineffective at mitigating this “data wall” of embedding-related transfers. To address this, we begin with a detailed analysis of vertex access patterns over training iterations and find that infrequently sampled vertices, despite incurring the majority of embedding-loading latency, undergo very few updates, making their embeddings ideal candidates for staleness reuse. Driven by this, we propose GLPilot, a novel system that mitigates embedding-related bottlenecks. GLPilot introduces a staleness-bounded embedding buffering mechanism to reduce remote fetches and a local gradient aggregation technique to minimize redundant communications during synchronization. Additionally, GLPilot utilizes an on-GPU cache for keeping mostly updated embeddings to alleviate CPU-GPU data transfer bottlenecks. Our evaluations on a 32-GPU cluster using two popular GNN models, three datasets and two optimizers demonstrate that GLPilot consistently achieves 1.28–1.93× per-epoch training speedups, in comparison with two strong baselines such as DGL and P3, while maintaining comparable model accuracy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"489-503"},"PeriodicalIF":6.0,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1109/TPDS.2025.3633298
Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu
For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.
{"title":"Fully Decentralized Data Distribution for Large-Scale HPC Systems","authors":"Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu","doi":"10.1109/TPDS.2025.3633298","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3633298","url":null,"abstract":"For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"304-321"},"PeriodicalIF":6.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1109/TPDS.2025.3632073
Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li
Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.
{"title":"DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM","authors":"Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2025.3632073","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632073","url":null,"abstract":"Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"213-229"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1109/TPDS.2025.3632089
Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers
Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for computation-communication-separated orchestration to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.
{"title":"HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity","authors":"Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers","doi":"10.1109/TPDS.2025.3632089","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632089","url":null,"abstract":"Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for <italic>computation-communication-separated orchestration</i> to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"272-286"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.
{"title":"D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers","authors":"Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen","doi":"10.1109/TPDS.2025.3631654","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631654","url":null,"abstract":"The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"230-246"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
{"title":"How to Evaluate Distributed Coordination Systems?–A Survey and Analysis","authors":"Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas","doi":"10.1109/TPDS.2025.3631614","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631614","url":null,"abstract":"Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"198-212"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}