Pub Date : 2026-01-29DOI: 10.1109/TPDS.2026.3659324
Stijn Heldens;Ben van Werkhoven
Reduced-precision floating-point arithmetic has become increasingly important in GPU applications for AI and HPC, as it can deliver substantial speedups while reducing energy consumption and memory footprint. However, choosing the appropriate data formats brings a challenging tuning problem: precision parameters must be chosen to maximize performance while preserving numerical accuracy. At the same time, GPU kernels typically expose additional tunable optimization parameters, such as block size, tiling strategy, and vector width. The combination of these two kinds of parameters results in a complex trade-off between accuracy and performance, making manual exploration of the resulting design space time-consuming. In this work, we present an accuracy-aware extension to the open-source Kernel Tuner framework, enabling automatic tuning of floating-point precision parameters alongside conventional code-optimization parameters. We evaluate our accuracy-aware tuning solution on both Nvidia and AMD GPUs using a variety of kernels. Our results show speedups of up to $12{times }$ over double precision, demonstrate how Kernel Tuner’s built-in search strategies are effective for accuracy-aware tuning, and show that our approach can be extended to other optimization objectives, such as memory footprint or energy efficiency. Moreover, we highlight that jointly tuning accuracy- and performance-affecting parameters outperforms isolated approaches in finding the best-performing configurations, despite significantly expanding the optimization space. This unified approach enables developers to trade accuracy for throughput systematically, enabling broader adoption of mixed-precision computing in scientific and industrial applications.
{"title":"Accuracy-Aware Mixed-Precision GPU Auto-Tuning","authors":"Stijn Heldens;Ben van Werkhoven","doi":"10.1109/TPDS.2026.3659324","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3659324","url":null,"abstract":"Reduced-precision floating-point arithmetic has become increasingly important in GPU applications for AI and HPC, as it can deliver substantial speedups while reducing energy consumption and memory footprint. However, choosing the appropriate data formats brings a challenging tuning problem: precision parameters must be chosen to maximize performance while preserving numerical accuracy. At the same time, GPU kernels typically expose additional tunable optimization parameters, such as block size, tiling strategy, and vector width. The combination of these two kinds of parameters results in a complex trade-off between accuracy and performance, making manual exploration of the resulting design space time-consuming. In this work, we present an <i>accuracy-aware</i> extension to the open-source <i>Kernel Tuner</i> framework, enabling automatic tuning of floating-point precision parameters alongside conventional code-optimization parameters. We evaluate our accuracy-aware tuning solution on both Nvidia and AMD GPUs using a variety of kernels. Our results show speedups of up to <inline-formula><tex-math>$12{times }$</tex-math></inline-formula> over double precision, demonstrate how Kernel Tuner’s built-in search strategies are effective for accuracy-aware tuning, and show that our approach can be extended to other optimization objectives, such as memory footprint or energy efficiency. Moreover, we highlight that jointly tuning accuracy- and performance-affecting parameters outperforms isolated approaches in finding the best-performing configurations, despite significantly expanding the optimization space. This unified approach enables developers to trade accuracy for throughput systematically, enabling broader adoption of mixed-precision computing in scientific and industrial applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"867-884"},"PeriodicalIF":6.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11367475","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The efficient execution of data center jobs that require simultaneous use of different resource types is of critical importance. When processing capacity is the crucial resource for jobs execution, the locution multiserver jobs is used, where the term server indicates processors or CPU cores providing processing capacity. Each multiserver job carries a requirement expressed in number of servers it requires to run, and service duration. Achieving efficient execution of multiserver jobs relies heavily on effective scheduling of jobs on the existing servers. Several schedulers have been proposed, aimed at improving resource utilization, at the cost of increased complexity. Due to the limited availability of theoretical results on scheduler behavior in the case of multiserver jobs, data center schedulers are often designed based only on managers’ experience. In this article, aiming to expand the understanding of the multiserver job schedulers’ performance, we study Small Shuffle (SMASH) schedulers, a class of nonpreemptive, service time oblivious, window-based multiserver job scheduling algorithms that strike a balance between simplicity and efficient resource utilization, while allowing performance evaluation in simpler settings. SMASH implies only a marginal increase in complexity compared to FIFO, yet it delivers substantial performance improvements for multiserver jobs. Depending on the system parameters, SMASH can nearly double the system’s stability region with respect to FIFO, leading to significantly lower response times across a broad region of loads. Moreover, the magnitude of this improvement scales with the chosen window size, allowing performance to be tuned to the system’s operating conditions. We first study the capacity of SMASH with analytical tools in simple settings, then we investigate the performance of SMASH and other schedulers with simulations under more realistic workloads, designed with parameters derived from measurements of real data centers. Results show that SMASH offers a very good compromise between performance and complexity.
{"title":"On the Performance of SMASH: A Non-Preemptive Window-Based Scheduler for Multiserver Jobs","authors":"Diletta Olliaro;Sabina Rossi;Adityo Anggraito;Andrea Marin;Marco Ajmone Marsan","doi":"10.1109/TPDS.2026.3657959","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3657959","url":null,"abstract":"The efficient execution of data center jobs that require simultaneous use of different resource types is of critical importance. When processing capacity is the crucial resource for jobs execution, the locution <italic>multiserver jobs</i> is used, where the term server indicates processors or CPU cores providing processing capacity. Each multiserver job carries a requirement expressed in number of servers it requires to run, and service duration. Achieving efficient execution of multiserver jobs relies heavily on effective scheduling of jobs on the existing servers. Several schedulers have been proposed, aimed at improving resource utilization, at the cost of increased complexity. Due to the limited availability of theoretical results on scheduler behavior in the case of multiserver jobs, data center schedulers are often designed based only on managers’ experience. In this article, aiming to expand the understanding of the multiserver job schedulers’ performance, we study Small Shuffle (SMASH) schedulers, a class of nonpreemptive, service time oblivious, window-based multiserver job scheduling algorithms that strike a balance between simplicity and efficient resource utilization, while allowing performance evaluation in simpler settings. SMASH implies only a marginal increase in complexity compared to FIFO, yet it delivers substantial performance improvements for multiserver jobs. Depending on the system parameters, SMASH can nearly double the system’s stability region with respect to FIFO, leading to significantly lower response times across a broad region of loads. Moreover, the magnitude of this improvement scales with the chosen window size, allowing performance to be tuned to the system’s operating conditions. We first study the capacity of SMASH with analytical tools in simple settings, then we investigate the performance of SMASH and other schedulers with simulations under more realistic workloads, designed with parameters derived from measurements of real data centers. Results show that SMASH offers a very good compromise between performance and complexity.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"966-981"},"PeriodicalIF":6.0,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11364077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leveraging the latest Sunway supercomputer, we developed a fully optimized earthquake simulation model that accurately captures topographic effects for realistic seismic analysis. Optimizing for the SW26010Pro architecture with DMA/RMA communication mechanisms, data compression schemes, and vectorization, we achieved a speedup exceeding 160×. Our pipeline-based computation and communication overlapping scheme, combined with performance prediction models further minimized computational costs. These optimizations enabled the largest-scale curvilinear grid finite-difference method (CGFDM) earthquake simulations to date, covering 197 trillion grid points and achieving 86.7 PFLOPS on 39 million cores with a weak scaling efficiency of 97.9%. These advancements enabled the successful simulation of the 2008 Wenchuan earthquake, providing high-resolution seismic insights and robust assessments for regional hazard mitigation and disaster preparedness.
{"title":"Exploiting the Performance Potential of Extreme-Scale Earthquake Simulation: Achieving 86.7 PFLOPS With Over 39 Million Cores","authors":"Lin Gan;Wubing Wan;Zekun Yin;Wenqiang Wang;Yilong Li;Zhenguo Zhang;Zhong He;Ping Gao;Xiaohui Duan;Weiguo Liu;Wei Xue;Haohuan Fu;Guangwen Yang;Xiaofei Chen","doi":"10.1109/TPDS.2026.3658568","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658568","url":null,"abstract":"Leveraging the latest Sunway supercomputer, we developed a fully optimized earthquake simulation model that accurately captures topographic effects for realistic seismic analysis. Optimizing for the SW26010Pro architecture with DMA/RMA communication mechanisms, data compression schemes, and vectorization, we achieved a speedup exceeding 160×. Our pipeline-based computation and communication overlapping scheme, combined with performance prediction models further minimized computational costs. These optimizations enabled the largest-scale curvilinear grid finite-difference method (CGFDM) earthquake simulations to date, covering 197 trillion grid points and achieving 86.7 PFLOPS on 39 million cores with a weak scaling efficiency of 97.9%. These advancements enabled the successful simulation of the 2008 Wenchuan earthquake, providing high-resolution seismic insights and robust assessments for regional hazard mitigation and disaster preparedness.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"997-1014"},"PeriodicalIF":6.0,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1109/TPDS.2026.3658162
Thomas Faingnaert;Ward Vermeulen;Tim Besard;Bjorn De Sutter
Tensor contractions extend the concept of the General Matrix Multiplication (GEMM) to high-dimensional spaces. They enable sophisticated computations in various scientific disciplines. Graphics Processing Units (GPUs) are commonly used to accelerate tensor contraction algorithms due to their inherent parallelisability. NVIDIA’s cuTENSOR stands as a state-of-the-art library for GPU-based tensor contractions. However, its lack of flexibility limits researchers in tailoring contraction kernels to their specific research needs. This paper presents a novel and flexible implementation of the GEMM-like Tensor Tensor (GETT) multiplication algorithm for tensor contractions in Julia. By repurposing and adapting components of GemmKernels.jl, a versatile library offering customisable and high-performance GEMM kernels for CUDA-enabled GPUs, we construct GEMM-like kernels that cater to the unique requirements of tensor contractions. Despite being entirely written in high-level Julia code and not yet exploiting a range of modern CUDA hardware features, the average performance of our library on standard tensor contractions compares favourably to cuTENSOR’s hand-optimised implementations, with outliers in both directions (faster and slower). When flexibility is needed, e.g. to fuse arbitrary elementwise operations into kernels, our library performs up to an order of magnitude faster than cuTENSOR, even on recent, data centre-grade devices such as the RTX 6000 Ada.
{"title":"Flexible Performant Tensor Contractions on GPUs","authors":"Thomas Faingnaert;Ward Vermeulen;Tim Besard;Bjorn De Sutter","doi":"10.1109/TPDS.2026.3658162","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658162","url":null,"abstract":"Tensor contractions extend the concept of the General Matrix Multiplication (GEMM) to high-dimensional spaces. They enable sophisticated computations in various scientific disciplines. Graphics Processing Units (GPUs) are commonly used to accelerate tensor contraction algorithms due to their inherent parallelisability. NVIDIA’s cuTENSOR stands as a state-of-the-art library for GPU-based tensor contractions. However, its lack of flexibility limits researchers in tailoring contraction kernels to their specific research needs. This paper presents a novel and flexible implementation of the GEMM-like Tensor Tensor (GETT) multiplication algorithm for tensor contractions in Julia. By repurposing and adapting components of GemmKernels.jl, a versatile library offering customisable and high-performance GEMM kernels for CUDA-enabled GPUs, we construct GEMM-like kernels that cater to the unique requirements of tensor contractions. Despite being entirely written in high-level Julia code and not yet exploiting a range of modern CUDA hardware features, the average performance of our library on standard tensor contractions compares favourably to cuTENSOR’s hand-optimised implementations, with outliers in both directions (faster and slower). When flexibility is needed, e.g. to fuse arbitrary elementwise operations into kernels, our library performs up to an order of magnitude faster than cuTENSOR, even on recent, data centre-grade devices such as the RTX 6000 Ada.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"787-804"},"PeriodicalIF":6.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The transaction execution layer is a key determinant of throughput in permissioned blockchains. While recent Shared Memory Pools (SMP)-based approaches improve throughput by enabling all consensus nodes to participate in transaction packaging, they face two fundamental limitations. First, the performance bottleneck shifts from the consensus layer to the transaction execution layer as transaction number confirmed in a round increases. Second, these approaches are vulnerable to “transaction duplication” attacks where malicious clients can simultaneously send the same transaction to multiple consensus nodes, thereby decreasing the number of valid transactions in block proposals. To address these limitations, this paper introduces Nexus, a novel blockchain transaction processing framework with high scalability. Nexus leverages the idle computational resources of full nodes to enable transaction execution in parallel with the consensus. Moreover, Nexus allows each node to handle only a fraction of the total transactions and share execution results with others. This approach reduces overall transaction execution time, increases throughput, and decreases latency. Lastly, Nexus introduces a transaction partitioning mechanism that effectively addresses the “transaction duplication” attack and achieves load balancing between clients and consensus nodes. Our implementation of Nexus demonstrates significant improvements: throughput increases by 4x to 15x, and latency is reduced by 50% to 70%.
{"title":"Nexus: A Novel Transaction Processing Framework for Permissioned Blockchain","authors":"Shengjie Guan;Rongkai Zhang;Qiuyu Ding;Mingxuan Song;Zhen Xiao;Jieyi Long;Mingchao Wan;Taifu Yuan;Jin Dong","doi":"10.1109/TPDS.2026.3658222","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658222","url":null,"abstract":"The transaction execution layer is a key determinant of throughput in permissioned blockchains. While recent Shared Memory Pools (SMP)-based approaches improve throughput by enabling all consensus nodes to participate in transaction packaging, they face two fundamental limitations. First, the performance bottleneck shifts from the consensus layer to the transaction execution layer as transaction number confirmed in a round increases. Second, these approaches are vulnerable to “transaction duplication” attacks where malicious clients can simultaneously send the same transaction to multiple consensus nodes, thereby decreasing the number of valid transactions in block proposals. To address these limitations, this paper introduces <italic>Nexus</i>, a novel blockchain transaction processing framework with high scalability. <italic>Nexus</i> leverages the idle computational resources of full nodes to enable transaction execution in parallel with the consensus. Moreover, <italic>Nexus</i> allows each node to handle only a fraction of the total transactions and share execution results with others. This approach reduces overall transaction execution time, increases throughput, and decreases latency. Lastly, <italic>Nexus</i> introduces a transaction partitioning mechanism that effectively addresses the “transaction duplication” attack and achieves load balancing between clients and consensus nodes. Our implementation of <italic>Nexus</i> demonstrates significant improvements: throughput increases by 4x to 15x, and latency is reduced by 50% to 70%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"822-835"},"PeriodicalIF":6.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-26DOI: 10.1109/TPDS.2026.3657795
Kartik Vishal Deshpande;Dheeraj Kumar;Osmar R Zaïane
Clustering spatio-temporal data in distributed systems is crucial for various applications such as traffic management, smart cities, telecommunications, and environmental monitoring. Despite the notable progress made in this field, several significant challenges persist: (a) in centralized systems, spatio-temporal data clustering necessitates that data be sent to the cloud for processing, which raises concerns about data transmission costs, latency, and privacy and security, (b) centralized systems incur high computational costs and require expensive hardware, resulting in prolonged runtime for algorithms, and (c) lack of well-defined space and time contiguous clusters adversely affects the overall usability of the clusters produced. These challenges are addressed by the proposed dnccVAT algorithm for assessing clustering tendency in spatio-temporal data within distributed systems, which is part of the visual assessment of clustering tendency family of algorithms. This algorithm effectively navigates the complexities associated with spatial-temporal relationships while minimizing communication overhead and ensuring scalability across distributed participant nodes. Extensive experiments were carried out on six real-world datasets, one of them being high-dimensional Big Data, comparing the proposed method with four state-of-the-art spatio-temporal data clustering algorithms and evaluating seven different performance measures to provide valuable insights into the effectiveness of the proposed approach.
{"title":"dnccVAT : A Fully Distributed Approach for Clustering Tendency Assessment of IoT Generated Spatio-Temporal Data","authors":"Kartik Vishal Deshpande;Dheeraj Kumar;Osmar R Zaïane","doi":"10.1109/TPDS.2026.3657795","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3657795","url":null,"abstract":"Clustering spatio-temporal data in distributed systems is crucial for various applications such as traffic management, smart cities, telecommunications, and environmental monitoring. Despite the notable progress made in this field, several significant challenges persist: (a) in centralized systems, spatio-temporal data clustering necessitates that data be sent to the cloud for processing, which raises concerns about data transmission costs, latency, and privacy and security, (b) centralized systems incur high computational costs and require expensive hardware, resulting in prolonged runtime for algorithms, and (c) lack of well-defined space and time contiguous clusters adversely affects the overall usability of the clusters produced. These challenges are addressed by the proposed dnccVAT algorithm for assessing clustering tendency in spatio-temporal data within distributed systems, which is part of the visual assessment of clustering tendency family of algorithms. This algorithm effectively navigates the complexities associated with spatial-temporal relationships while minimizing communication overhead and ensuring scalability across distributed participant nodes. Extensive experiments were carried out on six real-world datasets, one of them being high-dimensional Big Data, comparing the proposed method with four state-of-the-art spatio-temporal data clustering algorithms and evaluating seven different performance measures to provide valuable insights into the effectiveness of the proposed approach.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"762-774"},"PeriodicalIF":6.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/TPDS.2026.3655842
Amir Hossein Ansari;Moein Esnaashari;Sepideh Safari;Mohsen Ansari;Alireza Ejlali;Jörg Henkel
With the advancement of technology size and the integration of multiple cores on a single chip, the probability of fault occurrence has increased. These faults can be transient or permanent, requiring techniques to manage both types. Hybrid fault tolerance techniques have emerged as effective solutions to handle both types. In this paper, we propose a power-aware hybrid fault tolerance (called Pilot). Our approach utilizes checkpointing with rollback-recovery and primary/backup techniques, tolerating two kinds of faults. Moreover, in real-time embedded systems, power consumption is a critical constraint that must be managed. To do this, we exploit the Thermal Safe Power (TSP) constraint for each processing core. Based on this constraint and the utilization of each core, tasks are mapped and scheduled, while guaranteeing the timing constraints. Our experimental results demonstrate that our proposed methods can meet the reliability target by tolerating the optimal number of fault occurrences in each task while reducing power consumption. Our proposed methods are compared to state-of-the-art techniques in terms of schedulability, power consumption, Quality of Service (QoS), energy consumption, and reliability. The peak power and energy consumption are reduced on average by 34.2% and 15.9%, respectively, the QoS is improved on average to 28.7%, and the schedulability is improved on average to 14.6% while satisfying the system reliability target.
{"title":"Pilot: Power-Aware Hybrid Fault Tolerance in Multi-Core Embedded Systems","authors":"Amir Hossein Ansari;Moein Esnaashari;Sepideh Safari;Mohsen Ansari;Alireza Ejlali;Jörg Henkel","doi":"10.1109/TPDS.2026.3655842","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3655842","url":null,"abstract":"With the advancement of technology size and the integration of multiple cores on a single chip, the probability of fault occurrence has increased. These faults can be transient or permanent, requiring techniques to manage both types. Hybrid fault tolerance techniques have emerged as effective solutions to handle both types. In this paper, we propose a power-aware hybrid fault tolerance (called Pilot). Our approach utilizes checkpointing with rollback-recovery and primary/backup techniques, tolerating two kinds of faults. Moreover, in real-time embedded systems, power consumption is a critical constraint that must be managed. To do this, we exploit the Thermal Safe Power (TSP) constraint for each processing core. Based on this constraint and the utilization of each core, tasks are mapped and scheduled, while guaranteeing the timing constraints. Our experimental results demonstrate that our proposed methods can meet the reliability target by tolerating the optimal number of fault occurrences in each task while reducing power consumption. Our proposed methods are compared to state-of-the-art techniques in terms of schedulability, power consumption, Quality of Service (QoS), energy consumption, and reliability. The peak power and energy consumption are reduced on average by 34.2% and 15.9%, respectively, the QoS is improved on average to 28.7%, and the schedulability is improved on average to 14.6% while satisfying the system reliability target.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"726-743"},"PeriodicalIF":6.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/TPDS.2026.3654605
Dezhi Chen;Hongchuan He;Qi Qi;Jingyu Wang;Rongxin Han;Bo He;Zirui Zhuang;Qianlong Fu;Jianxin Liao;Zhu Han
Multi-agent cooperation is an open challenge in intelligent transportation systems (ITS). Traditional rule-based algorithms struggle to adapt to dynamic and uncertain environments, while learning-based algorithms are hindered by the scarcity and cost of labeled data. Reinforcement Learning (RL) offers a promising solution within ITS, as it allows for data acquisition through environmental interaction. However, our investigation has identified two primary issues when deploying RL-based algorithms: (1) The design of the reward function should strike a balance between the cooperative and competitive attributes of the system. Purely cooperative reward designs are challenging to learn due to delayed and sparse feedback, while individualized competitive reward designs may promote selfish behavior and rely heavily on expert knowledge. (2) Learning RL from scratch is also problematic due to the reliance of data generation on policy exploration. Pre-training can provide an initial model to circumvent learning difficulties, but its performance is constrained by the traditional algorithm that supplies the data, necessitating novel solutions to further improve model performance. In this paper, we introduce Hammurabi, a framework designed to enhance cooperation and improve the pre-training model within ITS. Hammurabi employs a social dilemma tool to assess the cooperative properties of the pre-trained policy and incorporates them into specific game models. Based on specific game models, we can leverage existing mature conclusions from game theory to assist in the design of reinforcement learning, thereby enhancing agent cooperation. Theoretical analysis shows that by adopting a multi-agent reinforcement learning scheme with policy shared parameters, Hammurabi can converge multi-agent policies to Nash equilibrium. We illustrate the application of Hammurabi in addressing practical issues within a multi-objective optimization multi-UAV system, demonstrating performance improvements across various optimization objectives compared to baseline algorithms.
{"title":"Hammurabi: Establish Cooperative Order From Pre-Trained Policies in Multi-UAV Networks","authors":"Dezhi Chen;Hongchuan He;Qi Qi;Jingyu Wang;Rongxin Han;Bo He;Zirui Zhuang;Qianlong Fu;Jianxin Liao;Zhu Han","doi":"10.1109/TPDS.2026.3654605","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3654605","url":null,"abstract":"Multi-agent cooperation is an open challenge in intelligent transportation systems (ITS). Traditional rule-based algorithms struggle to adapt to dynamic and uncertain environments, while learning-based algorithms are hindered by the scarcity and cost of labeled data. Reinforcement Learning (RL) offers a promising solution within ITS, as it allows for data acquisition through environmental interaction. However, our investigation has identified two primary issues when deploying RL-based algorithms: (1) The design of the reward function should strike a balance between the cooperative and competitive attributes of the system. Purely cooperative reward designs are challenging to learn due to delayed and sparse feedback, while individualized competitive reward designs may promote selfish behavior and rely heavily on expert knowledge. (2) Learning RL from scratch is also problematic due to the reliance of data generation on policy exploration. Pre-training can provide an initial model to circumvent learning difficulties, but its performance is constrained by the traditional algorithm that supplies the data, necessitating novel solutions to further improve model performance. In this paper, we introduce Hammurabi, a framework designed to enhance cooperation and improve the pre-training model within ITS. Hammurabi employs a social dilemma tool to assess the cooperative properties of the pre-trained policy and incorporates them into specific game models. Based on specific game models, we can leverage existing mature conclusions from game theory to assist in the design of reinforcement learning, thereby enhancing agent cooperation. Theoretical analysis shows that by adopting a multi-agent reinforcement learning scheme with policy shared parameters, Hammurabi can converge multi-agent policies to Nash equilibrium. We illustrate the application of Hammurabi in addressing practical issues within a multi-objective optimization multi-UAV system, demonstrating performance improvements across various optimization objectives compared to baseline algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"744-761"},"PeriodicalIF":6.0,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1109/TPDS.2026.3655025
Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye
Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.
{"title":"Optimization Method Based on K-WPA for Multinode Cooperative Localization Formation Grouping","authors":"Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye","doi":"10.1109/TPDS.2026.3655025","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3655025","url":null,"abstract":"Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"697-709"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose PAC+, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. PAC+ overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, PAC+ introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, PAC+ aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, PAC+ enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that PAC+ significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, PAC+ reduces memory footprint by up to 88.16%.
{"title":"Resource-Efficient Personal Large Language Models Fine-Tuning With Collaborative Edge Computing","authors":"Shengyuan Ye;Bei Ouyang;Tianyi Qian;Liekang Zeng;Jingyi Li;Jiangsu Du;Xiaowen Chu;Guoliang Xing;Xu Chen","doi":"10.1109/TPDS.2026.3654957","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3654957","url":null,"abstract":"Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose <monospace>PAC+</monospace>, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. <monospace>PAC+</monospace> overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, <monospace>PAC+</monospace> introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, <monospace>PAC+</monospace> aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, <monospace>PAC+</monospace> enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that <monospace>PAC+</monospace> significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, <monospace>PAC+</monospace> reduces memory footprint by up to 88.16%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"680-696"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}