Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this letter, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication.
{"title":"FedBiF: Communication-Efficient Federated Learning via Bits Freezing","authors":"Shiwei Li;Qunwei Li;Haozhao Wang;Ruixuan Li;Jianbin Lin;Wenliang Zhong","doi":"10.1109/TPDS.2025.3610224","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3610224","url":null,"abstract":"Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this letter, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2668-2678"},"PeriodicalIF":6.0,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPDS.2025.3608434
Yiqin Dai;Ruibo Wang;Yong Dong;Min Xie;Juan Chen;Wenzhe Zhang;Huijun Wu;Mingtian Shao;Kai Lu
As the size of MPI programs grows with expanding HPC resources and parallelism demands, the overhead of MPI startup and termination escalates due to the inclusion of less scalable global operations. Global operations involving extensive cross-machine communication and synchronization are crucial for ensuring semantic correctness. The current focus is on optimizing and accelerating these global operations rather than removing them, as the latter involves systematic changes to the system software stack and may impact program semantics. Given this background, we propose a systematic solution named MIST to safely eliminate global operations in MPI startup and termination. Through optimizing the generation of communication addresses, designing reliable communication protocols, and exploiting the resource release mechanism, MIST eliminates all global operations to achieve MPI instant startup and termination while ensuring correct program execution. Experiments on Tianhe-2 A supercomputer demonstrate that MIST can reduce the MPI_Init() time by 32.5-77.6% and the MPI_Finalize() time by 28.9-85.0%.
{"title":"MIST: Towards MPI Instant Startup and Termination on Tianhe HPC Systems","authors":"Yiqin Dai;Ruibo Wang;Yong Dong;Min Xie;Juan Chen;Wenzhe Zhang;Huijun Wu;Mingtian Shao;Kai Lu","doi":"10.1109/TPDS.2025.3608434","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3608434","url":null,"abstract":"As the size of MPI programs grows with expanding HPC resources and parallelism demands, the overhead of MPI startup and termination escalates due to the inclusion of less scalable global operations. Global operations involving extensive cross-machine communication and synchronization are crucial for ensuring semantic correctness. The current focus is on optimizing and accelerating these global operations rather than removing them, as the latter involves systematic changes to the system software stack and may impact program semantics. Given this background, we propose a systematic solution named MIST to safely eliminate global operations in MPI startup and termination. Through optimizing the generation of communication addresses, designing reliable communication protocols, and exploiting the resource release mechanism, MIST eliminates all global operations to achieve MPI instant startup and termination while ensuring correct program execution. Experiments on Tianhe-2 A supercomputer demonstrate that MIST can reduce the <italic>MPI_Init()</i> time by 32.5-77.6% and the <italic>MPI_Finalize()</i> time by 28.9-85.0%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2341-2353"},"PeriodicalIF":6.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-11DOI: 10.1109/TPDS.2025.3609152
Jie Gao;Jia Hu;Geyong Min;Fei Hao
Graph neural network (GNN) is a state-of-the-art technique for learning structural information from graph data. However, training GNNs on large-scale graphs is very challenging due to the size of real-world graphs and the message-passing architecture of GNNs. One promising approach for scaling GNNs is distributed training across multiple accelerators, where each accelerator holds a partitioned subgraph that fits in memory to train the model in parallel. Existing distributed GNN training methods require frequent and prohibitive embedding exchanges between partitions, leading to substantial communication overhead and limited the training efficiency. To address this challenge, we propose XDGNN, a novel distributed GNN training method that eliminates the forward communication bottleneck and thus accelerates training. Specifically, we design an explanation-guided subgraph expansion technique that incorporates important structures identified by eXplanation AI (XAI) methods into local partitions, mitigating information loss caused by graph partitioning. Then, XDGNN conducts communication-free distributed training on these self-contained partitions through training the model in parallel without communicating node embeddings in the forward phase. Extensive experiments demonstrate that XDGNN significantly improves training efficiency while maintaining the model accuracy compared with current distributed GNN training methods.
{"title":"XDGNN: Efficient Distributed GNN Training via Explanation-Guided Subgraph Expansion","authors":"Jie Gao;Jia Hu;Geyong Min;Fei Hao","doi":"10.1109/TPDS.2025.3609152","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3609152","url":null,"abstract":"Graph neural network (GNN) is a state-of-the-art technique for learning structural information from graph data. However, training GNNs on large-scale graphs is very challenging due to the size of real-world graphs and the message-passing architecture of GNNs. One promising approach for scaling GNNs is distributed training across multiple accelerators, where each accelerator holds a partitioned subgraph that fits in memory to train the model in parallel. Existing distributed GNN training methods require frequent and prohibitive embedding exchanges between partitions, leading to substantial communication overhead and limited the training efficiency. To address this challenge, we propose XDGNN, a novel distributed GNN training method that eliminates the forward communication bottleneck and thus accelerates training. Specifically, we design an explanation-guided subgraph expansion technique that incorporates important structures identified by eXplanation AI (XAI) methods into local partitions, mitigating information loss caused by graph partitioning. Then, XDGNN conducts communication-free distributed training on these self-contained partitions through training the model in parallel without communicating node embeddings in the forward phase. Extensive experiments demonstrate that XDGNN significantly improves training efficiency while maintaining the model accuracy compared with current distributed GNN training methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2354-2365"},"PeriodicalIF":6.0,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-08DOI: 10.1109/TPDS.2025.3606878
YuAng Chen;Jeffrey Xu Yu
Triangle counting is a fundamental graph algorithm used to identify the number of triangles within a graph. This algorithm can be reformulated into linear algebraic operations, including sparse matrix multiplication, intersection and reduction. Modern GPUs, equipped with Tensor Cores, offer massive parallelism that can significantly accelerate graph algorithms. However, leveraging Tensor Cores, originally designed for dense matrix multiplication, to handle sparse workloads for triangle counting presents non-trivial challenges. In this paper, we conduct an in-depth analysis of the state-of-the-art techniques that utilizes Tensor Cores for matrix operations, identifying critical performance shortfalls. Based on these insights, we introduce ToT, which enhances the utilization of Tensor Cores and expands their functionalities for diverse sparse matrix operations. In experiments, ToT is evaluated against state-of-the-art methods. ToT outperform the second-fastest method with an 3.81× speedup in end-to-end execution. Also, it achieves up to 17.00× memory savings. This work represents a pioneering exploration into utilizing Tensor Cores for accelerating the triangle counting algorithm.
{"title":"ToT: Triangle Counting on Tensor Cores","authors":"YuAng Chen;Jeffrey Xu Yu","doi":"10.1109/TPDS.2025.3606878","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3606878","url":null,"abstract":"Triangle counting is a fundamental graph algorithm used to identify the number of triangles within a graph. This algorithm can be reformulated into linear algebraic operations, including sparse matrix multiplication, intersection and reduction. Modern GPUs, equipped with Tensor Cores, offer massive parallelism that can significantly accelerate graph algorithms. However, leveraging Tensor Cores, originally designed for dense matrix multiplication, to handle sparse workloads for triangle counting presents non-trivial challenges. In this paper, we conduct an in-depth analysis of the state-of-the-art techniques that utilizes Tensor Cores for matrix operations, identifying critical performance shortfalls. Based on these insights, we introduce ToT, which enhances the utilization of Tensor Cores and expands their functionalities for diverse sparse matrix operations. In experiments, ToT is evaluated against state-of-the-art methods. ToT outperform the second-fastest method with an 3.81× speedup in end-to-end execution. Also, it achieves up to 17.00× memory savings. This work represents a pioneering exploration into utilizing Tensor Cores for accelerating the triangle counting algorithm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2679-2692"},"PeriodicalIF":6.0,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11153046","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04DOI: 10.1109/TPDS.2025.3606001
Bohuai Xiao;Chujia Yu;Xing Chen;Zheyi Chen;Geyong Min
Computation offloading utilizes powerful cloud and edge resources to process workflow applications offloaded from Mobile Devices (MDs), effectively alleviating the resource constraints of MDs. In end-edge-cloud environments, workflow applications typically exhibit complex task dependencies. Meanwhile, parallel tasks from multi-MDs result in an expansive solution space for offloading decisions. Therefore, determining optimal offloading plans for highly dynamic and complex end-edge-cloud environments presents significant challenges. The existing studies on offloading tasks for multi-MD workflows often adopt centralized decision-making methods, which suffer from prolonged decision time, high computational overhead, and inability to identify suitable offloading plans in large-scale scenarios. To address these challenges, we propose a Multi-agent Collaborative method for Workflow Task offloading in end-edge-cloud environments with the Actor-Critic algorithm called MCWT-AC. First, each MD is modeled as an agent and independently makes offloading decisions based on local information. Next, each MD’s workflow task offloading decision model is obtained through the Actor-Critic algorithm. At runtime, an effective workflow task offloading plan can be gradually developed through multi-agent collaboration. Extensive simulation results demonstrate that the MCWT-AC exhibits superior adaptability and scalability. Moreover, the MCWT-AC outperforms the state-of-art methods and can quickly achieve optimal/near-optimal performance.
{"title":"Multi-Agent Collaboration for Workflow Task Offloading in End-Edge-Cloud Environments Using Deep Reinforcement Learning","authors":"Bohuai Xiao;Chujia Yu;Xing Chen;Zheyi Chen;Geyong Min","doi":"10.1109/TPDS.2025.3606001","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3606001","url":null,"abstract":"Computation offloading utilizes powerful cloud and edge resources to process workflow applications offloaded from Mobile Devices (MDs), effectively alleviating the resource constraints of MDs. In end-edge-cloud environments, workflow applications typically exhibit complex task dependencies. Meanwhile, parallel tasks from multi-MDs result in an expansive solution space for offloading decisions. Therefore, determining optimal offloading plans for highly dynamic and complex end-edge-cloud environments presents significant challenges. The existing studies on offloading tasks for multi-MD workflows often adopt centralized decision-making methods, which suffer from prolonged decision time, high computational overhead, and inability to identify suitable offloading plans in large-scale scenarios. To address these challenges, we propose a Multi-agent Collaborative method for Workflow Task offloading in end-edge-cloud environments with the Actor-Critic algorithm called MCWT-AC. First, each MD is modeled as an agent and independently makes offloading decisions based on local information. Next, each MD’s workflow task offloading decision model is obtained through the Actor-Critic algorithm. At runtime, an effective workflow task offloading plan can be gradually developed through multi-agent collaboration. Extensive simulation results demonstrate that the MCWT-AC exhibits superior adaptability and scalability. Moreover, the MCWT-AC outperforms the state-of-art methods and can quickly achieve optimal/near-optimal performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2281-2296"},"PeriodicalIF":6.0,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1109/TPDS.2025.3605674
Olivier Beaumont;Rémi Bouzel;Lionel Eyraud-Dubois;Esragul Korkmaz;Laércio Lima Pilla;Alexandre Van Kempen
We study two offline job scheduling problems where tasks can be processed on a limited number of energy-efficient edge machines or offloaded to an unlimited supply of energy-inefficient cloud machines (called rejected). The objective is to minimize total energy consumption. First, we consider scheduling without deadlines, formulating it as a scheduling problem with rejection, where rejection costs are proportional to processing times. We propose a novel $frac{5}{4}(1+epsilon )$-approximation algorithm, $mathcal {BEKP}$, by associating it to a Multiple Subset Sum problem, improving upon the existing $ (frac{3}{2} - frac{1}{2m})$-approximation for arbitrary rejection costs. Next, we address scheduling with deadlines, aiming to minimize the weighted number of rejected jobs. We position this problem within the literature and introduce a new $(1-frac{(m-1)^{m}}{m^{m}})$-approximation algorithm, $mathcal {MDP}$, inspired by an interval selection algorithm with a $(1-frac{m^{m}}{(m+1)^{m}})$-approximation for arbitrary rejection costs. Experimental results demonstrate that $mathcal {BEKP}$ and $mathcal {MDP}$ obtain better results (lower costs or higher profits) than other state-of-the-art algorithms while maintaining a competitive or better time complexity.
{"title":"Approximation Algorithms for Scheduling With/Without Deadline Constraints Where Rejection Costs are Proportional to Processing Times","authors":"Olivier Beaumont;Rémi Bouzel;Lionel Eyraud-Dubois;Esragul Korkmaz;Laércio Lima Pilla;Alexandre Van Kempen","doi":"10.1109/TPDS.2025.3605674","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605674","url":null,"abstract":"We study two offline job scheduling problems where tasks can be processed on a limited number of energy-efficient edge machines or offloaded to an unlimited supply of energy-inefficient cloud machines (called rejected). The objective is to minimize total energy consumption. First, we consider scheduling without deadlines, formulating it as a scheduling problem with rejection, where rejection costs are proportional to processing times. We propose a novel <inline-formula><tex-math>$frac{5}{4}(1+epsilon )$</tex-math></inline-formula>-approximation algorithm, <inline-formula><tex-math>$mathcal {BEKP}$</tex-math></inline-formula>, by associating it to a Multiple Subset Sum problem, improving upon the existing <inline-formula><tex-math>$ (frac{3}{2} - frac{1}{2m})$</tex-math></inline-formula>-approximation for arbitrary rejection costs. Next, we address scheduling with deadlines, aiming to minimize the weighted number of rejected jobs. We position this problem within the literature and introduce a new <inline-formula><tex-math>$(1-frac{(m-1)^{m}}{m^{m}})$</tex-math></inline-formula>-approximation algorithm, <inline-formula><tex-math>$mathcal {MDP}$</tex-math></inline-formula>, inspired by an interval selection algorithm with a <inline-formula><tex-math>$(1-frac{m^{m}}{(m+1)^{m}})$</tex-math></inline-formula>-approximation for arbitrary rejection costs. Experimental results demonstrate that <inline-formula><tex-math>$mathcal {BEKP}$</tex-math></inline-formula> and <inline-formula><tex-math>$mathcal {MDP}$</tex-math></inline-formula> obtain better results (lower costs or higher profits) than other state-of-the-art algorithms while maintaining a competitive or better time complexity.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2596-2608"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-03DOI: 10.1109/TPDS.2025.3605491
Zhengyi Yuan;Xiong Wang;Yuntao Nie;Yufei Tao;Yuqing Li;Zhiyuan Shao;Xiaofei Liao;Bo Li;Hai Jin
Pipeline parallelism has emerged as an indispensable technique for training large deep neural networks. While existing asynchronous pipeline systems address the time bubbles inherent in synchronous architectures, they continue to suffer from inefficiency and susceptibility to volatile hardware environment due to their suboptimal and static configurations. In this article, we propose DynPipe, an interference-aware asynchronous pipeline framework to optimize the end-to-end training performance in highly dynamic computing environments. By characterizing the non-overlapped communication overheads and convergence rate conditioned on stage-wise staleness, DynPipe carefully crafts an optimized pipeline partition that harmonizes the hardware speed with statistical convergence. Moreover, DynPipe deploys a non-intrusive random forest model that utilizes runtime stage statistics to evaluate the impact of environmental changes, such as task interference and network jitter, on the training efficiency. Following the evaluation guidance, DynPipe adaptively adjusts partition plan to restore both intra and inter-stage load balancing, thereby facilitating seamless pipeline reconfiguration in dynamic environments. Extensive experiments show that DynPipe outperforms state-of-the-art systems, accelerating the time-to-accuracy by 1.5-3.4×.
{"title":"DynPipe: Toward Dynamic End-to-End Pipeline Parallelism for Interference-Aware DNN Training","authors":"Zhengyi Yuan;Xiong Wang;Yuntao Nie;Yufei Tao;Yuqing Li;Zhiyuan Shao;Xiaofei Liao;Bo Li;Hai Jin","doi":"10.1109/TPDS.2025.3605491","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605491","url":null,"abstract":"Pipeline parallelism has emerged as an indispensable technique for training large deep neural networks. While existing asynchronous pipeline systems address the time bubbles inherent in synchronous architectures, they continue to suffer from <italic>inefficiency</i> and <italic>susceptibility</i> to <italic>volatile</i> hardware environment due to their suboptimal and <italic>static</i> configurations. In this article, we propose DynPipe, an <italic>interference-aware</i> asynchronous pipeline framework to optimize the <italic>end-to-end</i> training performance in highly <italic>dynamic</i> computing environments. By characterizing the <italic>non-overlapped</i> communication overheads and <italic>convergence</i> rate conditioned on stage-wise staleness, DynPipe carefully crafts an optimized pipeline partition that harmonizes the hardware speed with statistical convergence. Moreover, DynPipe deploys a <italic>non-intrusive</i> random forest model that utilizes runtime stage statistics to evaluate the impact of environmental changes, such as task interference and network jitter, on the training efficiency. Following the evaluation guidance, DynPipe adaptively <italic>adjusts</i> partition plan to restore both intra and inter-stage load balancing, thereby facilitating seamless pipeline reconfiguration in dynamic environments. Extensive experiments show that DynPipe outperforms state-of-the-art systems, accelerating the time-to-accuracy by <italic>1.5-3.4×</i>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2366-2382"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11150566","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Approximate membership query (AMQ) data structures can approximately determine whether an element exists in a given dataset. They are widely used in parallel and distributed systems (e.g., high-performance databases, distributed cache systems, and bioinformatics systems) to avoid unnecessary dataset accesses, thereby accelerating massive data processing. For AMQ data structures used in the above systems, achieving high throughput, low false positive rate, and large capacity objectives simultaneously is critical but challenging. Porting AMQ data structures from DRAM to persistent memory makes it possible to achieve the above three objectives simultaneously, but this porting is not a trivial task. Specifically, existing AMQ data structures generate numerous random accesses and/or sequential writes on persistent memory, resulting in poor throughput. Therefore, in the conference version of this paper, we proposed a novel AMQ data structure called wormhole filter, which achieves high throughput on persistent memory, thereby achieving the above three objectives simultaneously. In this journal version, we extend our prior work by introducing parallel wormhole filters to enhance parallel performance. Additionally, we integrate parallel wormhole filters into the LevelDB database system to show that porting AMQ data structures to persistent memory significantly improves system end-to-end throughput. Theoretical analysis and experimental results show that wormhole filters significantly outperform state-of-the-art AMQ data structures. For example, wormhole filters achieve 12.06× insertion throughput, 1.98× positive lookup throughput, and 8.82× deletion throughput of the best competing baseline.
{"title":"Parallel Wormhole Filters: High-Performance Approximate Membership Query Data Structures for Persistent Memory","authors":"Hancheng Wang;Haipeng Dai;Shusen Chen;Meng Li;Rong Gu;Youyou Lu;Chengxun Wu;Jiaqi Zheng;Lexi Xu;Guihai Chen","doi":"10.1109/TPDS.2025.3605780","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605780","url":null,"abstract":"Approximate membership query (AMQ) data structures can approximately determine whether an element exists in a given dataset. They are widely used in parallel and distributed systems (e.g., high-performance databases, distributed cache systems, and bioinformatics systems) to avoid unnecessary dataset accesses, thereby accelerating massive data processing. For AMQ data structures used in the above systems, achieving high throughput, low false positive rate, and large capacity objectives simultaneously is critical but challenging. Porting AMQ data structures from DRAM to persistent memory makes it possible to achieve the above three objectives simultaneously, but this porting is not a trivial task. Specifically, existing AMQ data structures generate numerous random accesses and/or sequential writes on persistent memory, resulting in poor throughput. Therefore, in the conference version of this paper, we proposed a novel AMQ data structure called wormhole filter, which achieves high throughput on persistent memory, thereby achieving the above three objectives simultaneously. In this journal version, we extend our prior work by introducing parallel wormhole filters to enhance parallel performance. Additionally, we integrate parallel wormhole filters into the LevelDB database system to show that porting AMQ data structures to persistent memory significantly improves system end-to-end throughput. Theoretical analysis and experimental results show that wormhole filters significantly outperform state-of-the-art AMQ data structures. For example, wormhole filters achieve 12.06× insertion throughput, 1.98× positive lookup throughput, and 8.82× deletion throughput of the best competing baseline.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2229-2246"},"PeriodicalIF":6.0,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1109/TPDS.2025.3605272
Huijun Wang;Oliver Sinnen
Task scheduling for parallel computing is strongly NP-hard even without precedence constraints $P||C_{max}$. With any kind of precedence constraints and communication delays the problem becomes less manageable still. We look at the specific case of scheduling under the precedence constraints of a fork-join structure (including communication delays) $P[Q]|fork-join, c_{ij}|C_{max}$. This represents any kind of computation that divides into sub-computations with the end results being processed together. Looking at special cases where computation costs are equal, we propose polynomial time approximations and exact algorithms for them, considering homogenous and (related) heterogenous processors. Having those algorithms allows us to study the quality of heuristics in a large experimental evaluation. This demonstrates that heuristic schedulers perform well enough in most cases.
{"title":"Scheduling Fork-Joins With Communication Delays and Equal Processing Times on Heterogeneous Processors","authors":"Huijun Wang;Oliver Sinnen","doi":"10.1109/TPDS.2025.3605272","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3605272","url":null,"abstract":"Task scheduling for parallel computing is strongly NP-hard even without precedence constraints <inline-formula><tex-math>$P||C_{max}$</tex-math></inline-formula>. With any kind of precedence constraints and communication delays the problem becomes less manageable still. We look at the specific case of scheduling under the precedence constraints of a fork-join structure (including communication delays) <inline-formula><tex-math>$P[Q]|fork-join, c_{ij}|C_{max}$</tex-math></inline-formula>. This represents any kind of computation that divides into sub-computations with the end results being processed together. Looking at special cases where computation costs are equal, we propose polynomial time approximations and exact algorithms for them, considering homogenous and (related) heterogenous processors. Having those algorithms allows us to study the quality of heuristics in a large experimental evaluation. This demonstrates that heuristic schedulers perform well enough in most cases.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2297-2309"},"PeriodicalIF":6.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this article, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23× in CKKS HMUL and 1.15× in bootstrapping. It also achieves up to 4.87× and 1.51× speedups for TFHE bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3× average speedup for scheme switching over CPU-based implementation.
{"title":"Chameleon: An Efficient FHE Scheme Switching Acceleration on GPUs","authors":"Zhiwei Wang;Haoqi He;Lutan Zhao;Peinan Li;Zhihao Li;Dan Meng;Rui Hou","doi":"10.1109/TPDS.2025.3604866","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3604866","url":null,"abstract":"Fully homomorphic encryption (FHE) enables direct computation on encrypted data, making it a crucial technology for privacy protection. However, FHE suffers from significant performance bottlenecks. In this context, GPU acceleration offers a promising solution to bridge the performance gap. Existing efforts primarily focus on single-class FHE schemes, which fail to meet the diverse requirements of data types and functions, prompting the development of hybrid multi-class FHE schemes. However, studies have yet to thoroughly investigate specific GPU optimizations for hybrid FHE schemes. In this article, we present an efficient GPU-based FHE scheme switching acceleration named Chameleon. First, we propose a scalable NTT acceleration design that adapts to larger CKKS polynomials and smaller TFHE polynomials. Specifically, Chameleon tackles synchronization issues by fusing stages to reduce synchronization, employing polynomial coefficient shuffling to minimize synchronization scale, and utilizing an SM-aware combination strategy to identify the optimal switching point. Second, Chameleon is the first to comprehensively analyze and optimize critical switching operations. It introduces CMux-level parallelization to accelerate LUT evaluation and a homomorphic rotation-free matrix-vector multiplication to improve repacking efficiency. Finally, Chameleon outperforms the state-of-the-art GPU implementations by 1.23× in CKKS HMUL and 1.15× in bootstrapping. It also achieves up to 4.87× and 1.51× speedups for TFHE bootstrapping compared to CPU and GPU versions, respectively, and delivers a 67.3× average speedup for scheme switching over CPU-based implementation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2264-2280"},"PeriodicalIF":6.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11146703","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}