Pub Date : 2025-09-17DOI: 10.1109/TPDS.2025.3611446
Haochun Liang;Xu Jiang;Junyi Liu;Xiantong Luo;Songran Liu;Nan Guan;Wang Yi
Real-time systems are increasingly shifting from single processors to multiprocessors, where software must be parallelized to fully exploit the additional computational power. While the scheduling of real-time parallel tasks modeled as directed acyclic graphs (DAGs) has been extensively studied in the context of global scheduling, the scheduling and analysis of real-time DAG tasks under partitioned scheduling remain far less developed compared to the traditional scheduling of sequential tasks. Existing approaches primarily target plain fixed-priority partitioned scheduling and often rely on self-suspension–based analysis, which limits opportunities for further optimization. In particular, such methods fail to fully leverage fine-grained scheduling management that could improve schedulability. In this paper, we propose a novel approach for scheduling periodic DAG tasks, in which each DAG task is transformed into a set of real-time transactions by incorporating mechanisms for enforcing release offsets and intra-task priority assignments. We further develop corresponding analysis techniques and partitioning algorithms. Through comprehensive experiments, we evaluate the real-time performance of the proposed methods against state-of-the-art scheduling and analysis techniques. The results demonstrate that our approach consistently outperforms existing methods for scheduling periodic DAG tasks across a wide range of parameter settings.
{"title":"New Scheduling Algorithm and Analysis for Partitioned Periodic DAG Tasks on Multiprocessors","authors":"Haochun Liang;Xu Jiang;Junyi Liu;Xiantong Luo;Songran Liu;Nan Guan;Wang Yi","doi":"10.1109/TPDS.2025.3611446","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3611446","url":null,"abstract":"Real-time systems are increasingly shifting from single processors to multiprocessors, where software must be parallelized to fully exploit the additional computational power. While the scheduling of real-time parallel tasks modeled as directed acyclic graphs (DAGs) has been extensively studied in the context of global scheduling, the scheduling and analysis of real-time DAG tasks under partitioned scheduling remain far less developed compared to the traditional scheduling of sequential tasks. Existing approaches primarily target plain fixed-priority partitioned scheduling and often rely on self-suspension–based analysis, which limits opportunities for further optimization. In particular, such methods fail to fully leverage fine-grained scheduling management that could improve schedulability. In this paper, we propose a novel approach for scheduling periodic DAG tasks, in which each DAG task is transformed into a set of real-time transactions by incorporating mechanisms for enforcing release offsets and intra-task priority assignments. We further develop corresponding analysis techniques and partitioning algorithms. Through comprehensive experiments, we evaluate the real-time performance of the proposed methods against state-of-the-art scheduling and analysis techniques. The results demonstrate that our approach consistently outperforms existing methods for scheduling periodic DAG tasks across a wide range of parameter settings.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2621-2634"},"PeriodicalIF":6.0,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-16DOI: 10.1109/TPDS.2025.3610354
Kyrian C. Adimora;Hongyang Sun
Exascale high-performance computing (HPC) systems face critical resource management challenges such as massive energy consumption in megawatts per facility, performance variability for identical jobs, and resource utilization inefficiencies. Traditional single-objective schedulers cannot address these multifaceted challenges effectively. This paper introduces HARMONIC (Holistic Adaptive Resource Management Optimizing Next-generation Interconnected Computing), a novel framework that simultaneously optimizes performance, energy efficiency, and resilience through uncertainty-aware multi-objective optimization. Our approach distinguishes aleatoric uncertainty (inherent system variability) from epistemic uncertainty (modeling limitations) using Bayesian neural networks and employs graph-based representations to capture complex system dependencies. Experimental validation in both simulated environments and controlled testbeds demonstrates significant improvements over state-of-the-art schedulers: 10–19% energy reduction, 16–25% throughput improvement and 18–32% performance variability reduction. These results translate to potential annual savings of multimillion dollars per exascale facility while enhancing scientific productivity through improved experimental reproducibility.
{"title":"HARMONIC: Uncertainty-Aware Multi-Objective Optimization for Energy-Efficient HPC Resource Management","authors":"Kyrian C. Adimora;Hongyang Sun","doi":"10.1109/TPDS.2025.3610354","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3610354","url":null,"abstract":"Exascale high-performance computing (HPC) systems face critical resource management challenges such as massive energy consumption in megawatts per facility, performance variability for identical jobs, and resource utilization inefficiencies. Traditional single-objective schedulers cannot address these multifaceted challenges effectively. This paper introduces <italic>HARMONIC</i> (Holistic Adaptive Resource Management Optimizing Next-generation Interconnected Computing), a novel framework that simultaneously optimizes performance, energy efficiency, and resilience through uncertainty-aware multi-objective optimization. Our approach distinguishes aleatoric uncertainty (inherent system variability) from epistemic uncertainty (modeling limitations) using Bayesian neural networks and employs graph-based representations to capture complex system dependencies. Experimental validation in both simulated environments and controlled testbeds demonstrates significant improvements over state-of-the-art schedulers: 10–19% energy reduction, 16–25% throughput improvement and 18–32% performance variability reduction. These results translate to potential annual savings of multimillion dollars per exascale facility while enhancing scientific productivity through improved experimental reproducibility.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2438-2450"},"PeriodicalIF":6.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this letter, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication.
{"title":"FedBiF: Communication-Efficient Federated Learning via Bits Freezing","authors":"Shiwei Li;Qunwei Li;Haozhao Wang;Ruixuan Li;Jianbin Lin;Wenliang Zhong","doi":"10.1109/TPDS.2025.3610224","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3610224","url":null,"abstract":"Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this letter, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2668-2678"},"PeriodicalIF":6.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPDS.2025.3608434
Yiqin Dai;Ruibo Wang;Yong Dong;Min Xie;Juan Chen;Wenzhe Zhang;Huijun Wu;Mingtian Shao;Kai Lu
As the size of MPI programs grows with expanding HPC resources and parallelism demands, the overhead of MPI startup and termination escalates due to the inclusion of less scalable global operations. Global operations involving extensive cross-machine communication and synchronization are crucial for ensuring semantic correctness. The current focus is on optimizing and accelerating these global operations rather than removing them, as the latter involves systematic changes to the system software stack and may impact program semantics. Given this background, we propose a systematic solution named MIST to safely eliminate global operations in MPI startup and termination. Through optimizing the generation of communication addresses, designing reliable communication protocols, and exploiting the resource release mechanism, MIST eliminates all global operations to achieve MPI instant startup and termination while ensuring correct program execution. Experiments on Tianhe-2 A supercomputer demonstrate that MIST can reduce the MPI_Init() time by 32.5-77.6% and the MPI_Finalize() time by 28.9-85.0%.
{"title":"MIST: Towards MPI Instant Startup and Termination on Tianhe HPC Systems","authors":"Yiqin Dai;Ruibo Wang;Yong Dong;Min Xie;Juan Chen;Wenzhe Zhang;Huijun Wu;Mingtian Shao;Kai Lu","doi":"10.1109/TPDS.2025.3608434","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3608434","url":null,"abstract":"As the size of MPI programs grows with expanding HPC resources and parallelism demands, the overhead of MPI startup and termination escalates due to the inclusion of less scalable global operations. Global operations involving extensive cross-machine communication and synchronization are crucial for ensuring semantic correctness. The current focus is on optimizing and accelerating these global operations rather than removing them, as the latter involves systematic changes to the system software stack and may impact program semantics. Given this background, we propose a systematic solution named MIST to safely eliminate global operations in MPI startup and termination. Through optimizing the generation of communication addresses, designing reliable communication protocols, and exploiting the resource release mechanism, MIST eliminates all global operations to achieve MPI instant startup and termination while ensuring correct program execution. Experiments on Tianhe-2 A supercomputer demonstrate that MIST can reduce the <italic>MPI_Init()</i> time by 32.5-77.6% and the <italic>MPI_Finalize()</i> time by 28.9-85.0%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2341-2353"},"PeriodicalIF":6.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPDS.2025.3609152
Jie Gao;Jia Hu;Geyong Min;Fei Hao
Graph neural network (GNN) is a state-of-the-art technique for learning structural information from graph data. However, training GNNs on large-scale graphs is very challenging due to the size of real-world graphs and the message-passing architecture of GNNs. One promising approach for scaling GNNs is distributed training across multiple accelerators, where each accelerator holds a partitioned subgraph that fits in memory to train the model in parallel. Existing distributed GNN training methods require frequent and prohibitive embedding exchanges between partitions, leading to substantial communication overhead and limited the training efficiency. To address this challenge, we propose XDGNN, a novel distributed GNN training method that eliminates the forward communication bottleneck and thus accelerates training. Specifically, we design an explanation-guided subgraph expansion technique that incorporates important structures identified by eXplanation AI (XAI) methods into local partitions, mitigating information loss caused by graph partitioning. Then, XDGNN conducts communication-free distributed training on these self-contained partitions through training the model in parallel without communicating node embeddings in the forward phase. Extensive experiments demonstrate that XDGNN significantly improves training efficiency while maintaining the model accuracy compared with current distributed GNN training methods.
{"title":"XDGNN: Efficient Distributed GNN Training via Explanation-Guided Subgraph Expansion","authors":"Jie Gao;Jia Hu;Geyong Min;Fei Hao","doi":"10.1109/TPDS.2025.3609152","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3609152","url":null,"abstract":"Graph neural network (GNN) is a state-of-the-art technique for learning structural information from graph data. However, training GNNs on large-scale graphs is very challenging due to the size of real-world graphs and the message-passing architecture of GNNs. One promising approach for scaling GNNs is distributed training across multiple accelerators, where each accelerator holds a partitioned subgraph that fits in memory to train the model in parallel. Existing distributed GNN training methods require frequent and prohibitive embedding exchanges between partitions, leading to substantial communication overhead and limited the training efficiency. To address this challenge, we propose XDGNN, a novel distributed GNN training method that eliminates the forward communication bottleneck and thus accelerates training. Specifically, we design an explanation-guided subgraph expansion technique that incorporates important structures identified by eXplanation AI (XAI) methods into local partitions, mitigating information loss caused by graph partitioning. Then, XDGNN conducts communication-free distributed training on these self-contained partitions through training the model in parallel without communicating node embeddings in the forward phase. Extensive experiments demonstrate that XDGNN significantly improves training efficiency while maintaining the model accuracy compared with current distributed GNN training methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2354-2365"},"PeriodicalIF":6.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1109/TPDS.2025.3606878
YuAng Chen;Jeffrey Xu Yu
Triangle counting is a fundamental graph algorithm used to identify the number of triangles within a graph. This algorithm can be reformulated into linear algebraic operations, including sparse matrix multiplication, intersection and reduction. Modern GPUs, equipped with Tensor Cores, offer massive parallelism that can significantly accelerate graph algorithms. However, leveraging Tensor Cores, originally designed for dense matrix multiplication, to handle sparse workloads for triangle counting presents non-trivial challenges. In this paper, we conduct an in-depth analysis of the state-of-the-art techniques that utilizes Tensor Cores for matrix operations, identifying critical performance shortfalls. Based on these insights, we introduce ToT, which enhances the utilization of Tensor Cores and expands their functionalities for diverse sparse matrix operations. In experiments, ToT is evaluated against state-of-the-art methods. ToT outperform the second-fastest method with an 3.81× speedup in end-to-end execution. Also, it achieves up to 17.00× memory savings. This work represents a pioneering exploration into utilizing Tensor Cores for accelerating the triangle counting algorithm.
{"title":"ToT: Triangle Counting on Tensor Cores","authors":"YuAng Chen;Jeffrey Xu Yu","doi":"10.1109/TPDS.2025.3606878","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3606878","url":null,"abstract":"Triangle counting is a fundamental graph algorithm used to identify the number of triangles within a graph. This algorithm can be reformulated into linear algebraic operations, including sparse matrix multiplication, intersection and reduction. Modern GPUs, equipped with Tensor Cores, offer massive parallelism that can significantly accelerate graph algorithms. However, leveraging Tensor Cores, originally designed for dense matrix multiplication, to handle sparse workloads for triangle counting presents non-trivial challenges. In this paper, we conduct an in-depth analysis of the state-of-the-art techniques that utilizes Tensor Cores for matrix operations, identifying critical performance shortfalls. Based on these insights, we introduce ToT, which enhances the utilization of Tensor Cores and expands their functionalities for diverse sparse matrix operations. In experiments, ToT is evaluated against state-of-the-art methods. ToT outperform the second-fastest method with an 3.81× speedup in end-to-end execution. Also, it achieves up to 17.00× memory savings. This work represents a pioneering exploration into utilizing Tensor Cores for accelerating the triangle counting algorithm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2679-2692"},"PeriodicalIF":6.0,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11153046","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04DOI: 10.1109/TPDS.2025.3606001
Bohuai Xiao;Chujia Yu;Xing Chen;Zheyi Chen;Geyong Min
Computation offloading utilizes powerful cloud and edge resources to process workflow applications offloaded from Mobile Devices (MDs), effectively alleviating the resource constraints of MDs. In end-edge-cloud environments, workflow applications typically exhibit complex task dependencies. Meanwhile, parallel tasks from multi-MDs result in an expansive solution space for offloading decisions. Therefore, determining optimal offloading plans for highly dynamic and complex end-edge-cloud environments presents significant challenges. The existing studies on offloading tasks for multi-MD workflows often adopt centralized decision-making methods, which suffer from prolonged decision time, high computational overhead, and inability to identify suitable offloading plans in large-scale scenarios. To address these challenges, we propose a Multi-agent Collaborative method for Workflow Task offloading in end-edge-cloud environments with the Actor-Critic algorithm called MCWT-AC. First, each MD is modeled as an agent and independently makes offloading decisions based on local information. Next, each MD’s workflow task offloading decision model is obtained through the Actor-Critic algorithm. At runtime, an effective workflow task offloading plan can be gradually developed through multi-agent collaboration. Extensive simulation results demonstrate that the MCWT-AC exhibits superior adaptability and scalability. Moreover, the MCWT-AC outperforms the state-of-art methods and can quickly achieve optimal/near-optimal performance.
{"title":"Multi-Agent Collaboration for Workflow Task Offloading in End-Edge-Cloud Environments Using Deep Reinforcement Learning","authors":"Bohuai Xiao;Chujia Yu;Xing Chen;Zheyi Chen;Geyong Min","doi":"10.1109/TPDS.2025.3606001","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3606001","url":null,"abstract":"Computation offloading utilizes powerful cloud and edge resources to process workflow applications offloaded from Mobile Devices (MDs), effectively alleviating the resource constraints of MDs. In end-edge-cloud environments, workflow applications typically exhibit complex task dependencies. Meanwhile, parallel tasks from multi-MDs result in an expansive solution space for offloading decisions. Therefore, determining optimal offloading plans for highly dynamic and complex end-edge-cloud environments presents significant challenges. The existing studies on offloading tasks for multi-MD workflows often adopt centralized decision-making methods, which suffer from prolonged decision time, high computational overhead, and inability to identify suitable offloading plans in large-scale scenarios. To address these challenges, we propose a Multi-agent Collaborative method for Workflow Task offloading in end-edge-cloud environments with the Actor-Critic algorithm called MCWT-AC. First, each MD is modeled as an agent and independently makes offloading decisions based on local information. Next, each MD’s workflow task offloading decision model is obtained through the Actor-Critic algorithm. At runtime, an effective workflow task offloading plan can be gradually developed through multi-agent collaboration. Extensive simulation results demonstrate that the MCWT-AC exhibits superior adaptability and scalability. Moreover, the MCWT-AC outperforms the state-of-art methods and can quickly achieve optimal/near-optimal performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2281-2296"},"PeriodicalIF":6.0,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1109/TPDS.2025.3605674
Olivier Beaumont;Rémi Bouzel;Lionel Eyraud-Dubois;Esragul Korkmaz;Laércio Lima Pilla;Alexandre Van Kempen
We study two offline job scheduling problems where tasks can be processed on a limited number of energy-efficient edge machines or offloaded to an unlimited supply of energy-inefficient cloud machines (called rejected). The objective is to minimize total energy consumption. First, we consider scheduling without deadlines, formulating it as a scheduling problem with rejection, where rejection costs are proportional to processing times. We propose a novel $frac{5}{4}(1+epsilon )$-approximation algorithm, $mathcal {BEKP}$, by associating it to a Multiple Subset Sum problem, improving upon the existing $ (frac{3}{2} - frac{1}{2m})$-approximation for arbitrary rejection costs. Next, we address scheduling with deadlines, aiming to minimize the weighted number of rejected jobs. We position this problem within the literature and introduce a new $(1-frac{(m-1)^{m}}{m^{m}})$-approximation algorithm, $mathcal {MDP}$, inspired by an interval selection algorithm with a $(1-frac{m^{m}}{(m+1)^{m}})$-approximation for arbitrary rejection costs. Experimental results demonstrate that $mathcal {BEKP}$ and $mathcal {MDP}$ obtain better results (lower costs or higher profits) than other state-of-the-art algorithms while maintaining a competitive or better time complexity.
{"title":"Approximation Algorithms for Scheduling With/Without Deadline Constraints Where Rejection Costs are Proportional to Processing Times","authors":"Olivier Beaumont;Rémi Bouzel;Lionel Eyraud-Dubois;Esragul Korkmaz;Laércio Lima Pilla;Alexandre Van Kempen","doi":"10.1109/TPDS.2025.3605674","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605674","url":null,"abstract":"We study two offline job scheduling problems where tasks can be processed on a limited number of energy-efficient edge machines or offloaded to an unlimited supply of energy-inefficient cloud machines (called rejected). The objective is to minimize total energy consumption. First, we consider scheduling without deadlines, formulating it as a scheduling problem with rejection, where rejection costs are proportional to processing times. We propose a novel <inline-formula><tex-math>$frac{5}{4}(1+epsilon )$</tex-math></inline-formula>-approximation algorithm, <inline-formula><tex-math>$mathcal {BEKP}$</tex-math></inline-formula>, by associating it to a Multiple Subset Sum problem, improving upon the existing <inline-formula><tex-math>$ (frac{3}{2} - frac{1}{2m})$</tex-math></inline-formula>-approximation for arbitrary rejection costs. Next, we address scheduling with deadlines, aiming to minimize the weighted number of rejected jobs. We position this problem within the literature and introduce a new <inline-formula><tex-math>$(1-frac{(m-1)^{m}}{m^{m}})$</tex-math></inline-formula>-approximation algorithm, <inline-formula><tex-math>$mathcal {MDP}$</tex-math></inline-formula>, inspired by an interval selection algorithm with a <inline-formula><tex-math>$(1-frac{m^{m}}{(m+1)^{m}})$</tex-math></inline-formula>-approximation for arbitrary rejection costs. Experimental results demonstrate that <inline-formula><tex-math>$mathcal {BEKP}$</tex-math></inline-formula> and <inline-formula><tex-math>$mathcal {MDP}$</tex-math></inline-formula> obtain better results (lower costs or higher profits) than other state-of-the-art algorithms while maintaining a competitive or better time complexity.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2596-2608"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1109/TPDS.2025.3605491
Zhengyi Yuan;Xiong Wang;Yuntao Nie;Yufei Tao;Yuqing Li;Zhiyuan Shao;Xiaofei Liao;Bo Li;Hai Jin
Pipeline parallelism has emerged as an indispensable technique for training large deep neural networks. While existing asynchronous pipeline systems address the time bubbles inherent in synchronous architectures, they continue to suffer from inefficiency and susceptibility to volatile hardware environment due to their suboptimal and static configurations. In this article, we propose DynPipe, an interference-aware asynchronous pipeline framework to optimize the end-to-end training performance in highly dynamic computing environments. By characterizing the non-overlapped communication overheads and convergence rate conditioned on stage-wise staleness, DynPipe carefully crafts an optimized pipeline partition that harmonizes the hardware speed with statistical convergence. Moreover, DynPipe deploys a non-intrusive random forest model that utilizes runtime stage statistics to evaluate the impact of environmental changes, such as task interference and network jitter, on the training efficiency. Following the evaluation guidance, DynPipe adaptively adjusts partition plan to restore both intra and inter-stage load balancing, thereby facilitating seamless pipeline reconfiguration in dynamic environments. Extensive experiments show that DynPipe outperforms state-of-the-art systems, accelerating the time-to-accuracy by 1.5-3.4×.
{"title":"DynPipe: Toward Dynamic End-to-End Pipeline Parallelism for Interference-Aware DNN Training","authors":"Zhengyi Yuan;Xiong Wang;Yuntao Nie;Yufei Tao;Yuqing Li;Zhiyuan Shao;Xiaofei Liao;Bo Li;Hai Jin","doi":"10.1109/TPDS.2025.3605491","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605491","url":null,"abstract":"Pipeline parallelism has emerged as an indispensable technique for training large deep neural networks. While existing asynchronous pipeline systems address the time bubbles inherent in synchronous architectures, they continue to suffer from <italic>inefficiency</i> and <italic>susceptibility</i> to <italic>volatile</i> hardware environment due to their suboptimal and <italic>static</i> configurations. In this article, we propose DynPipe, an <italic>interference-aware</i> asynchronous pipeline framework to optimize the <italic>end-to-end</i> training performance in highly <italic>dynamic</i> computing environments. By characterizing the <italic>non-overlapped</i> communication overheads and <italic>convergence</i> rate conditioned on stage-wise staleness, DynPipe carefully crafts an optimized pipeline partition that harmonizes the hardware speed with statistical convergence. Moreover, DynPipe deploys a <italic>non-intrusive</i> random forest model that utilizes runtime stage statistics to evaluate the impact of environmental changes, such as task interference and network jitter, on the training efficiency. Following the evaluation guidance, DynPipe adaptively <italic>adjusts</i> partition plan to restore both intra and inter-stage load balancing, thereby facilitating seamless pipeline reconfiguration in dynamic environments. Extensive experiments show that DynPipe outperforms state-of-the-art systems, accelerating the time-to-accuracy by <italic>1.5-3.4×</i>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2366-2382"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11150566","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate membership query (AMQ) data structures can approximately determine whether an element exists in a given dataset. They are widely used in parallel and distributed systems (e.g., high-performance databases, distributed cache systems, and bioinformatics systems) to avoid unnecessary dataset accesses, thereby accelerating massive data processing. For AMQ data structures used in the above systems, achieving high throughput, low false positive rate, and large capacity objectives simultaneously is critical but challenging. Porting AMQ data structures from DRAM to persistent memory makes it possible to achieve the above three objectives simultaneously, but this porting is not a trivial task. Specifically, existing AMQ data structures generate numerous random accesses and/or sequential writes on persistent memory, resulting in poor throughput. Therefore, in the conference version of this paper, we proposed a novel AMQ data structure called wormhole filter, which achieves high throughput on persistent memory, thereby achieving the above three objectives simultaneously. In this journal version, we extend our prior work by introducing parallel wormhole filters to enhance parallel performance. Additionally, we integrate parallel wormhole filters into the LevelDB database system to show that porting AMQ data structures to persistent memory significantly improves system end-to-end throughput. Theoretical analysis and experimental results show that wormhole filters significantly outperform state-of-the-art AMQ data structures. For example, wormhole filters achieve 12.06× insertion throughput, 1.98× positive lookup throughput, and 8.82× deletion throughput of the best competing baseline.
{"title":"Parallel Wormhole Filters: High-Performance Approximate Membership Query Data Structures for Persistent Memory","authors":"Hancheng Wang;Haipeng Dai;Shusen Chen;Meng Li;Rong Gu;Youyou Lu;Chengxun Wu;Jiaqi Zheng;Lexi Xu;Guihai Chen","doi":"10.1109/TPDS.2025.3605780","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605780","url":null,"abstract":"Approximate membership query (AMQ) data structures can approximately determine whether an element exists in a given dataset. They are widely used in parallel and distributed systems (e.g., high-performance databases, distributed cache systems, and bioinformatics systems) to avoid unnecessary dataset accesses, thereby accelerating massive data processing. For AMQ data structures used in the above systems, achieving high throughput, low false positive rate, and large capacity objectives simultaneously is critical but challenging. Porting AMQ data structures from DRAM to persistent memory makes it possible to achieve the above three objectives simultaneously, but this porting is not a trivial task. Specifically, existing AMQ data structures generate numerous random accesses and/or sequential writes on persistent memory, resulting in poor throughput. Therefore, in the conference version of this paper, we proposed a novel AMQ data structure called wormhole filter, which achieves high throughput on persistent memory, thereby achieving the above three objectives simultaneously. In this journal version, we extend our prior work by introducing parallel wormhole filters to enhance parallel performance. Additionally, we integrate parallel wormhole filters into the LevelDB database system to show that porting AMQ data structures to persistent memory significantly improves system end-to-end throughput. Theoretical analysis and experimental results show that wormhole filters significantly outperform state-of-the-art AMQ data structures. For example, wormhole filters achieve 12.06× insertion throughput, 1.98× positive lookup throughput, and 8.82× deletion throughput of the best competing baseline.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2229-2246"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}