首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Accuracy-Aware Mixed-Precision GPU Auto-Tuning 精确感知混合精度GPU自动调优
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-29 DOI: 10.1109/TPDS.2026.3659324
Stijn Heldens;Ben van Werkhoven
Reduced-precision floating-point arithmetic has become increasingly important in GPU applications for AI and HPC, as it can deliver substantial speedups while reducing energy consumption and memory footprint. However, choosing the appropriate data formats brings a challenging tuning problem: precision parameters must be chosen to maximize performance while preserving numerical accuracy. At the same time, GPU kernels typically expose additional tunable optimization parameters, such as block size, tiling strategy, and vector width. The combination of these two kinds of parameters results in a complex trade-off between accuracy and performance, making manual exploration of the resulting design space time-consuming. In this work, we present an accuracy-aware extension to the open-source Kernel Tuner framework, enabling automatic tuning of floating-point precision parameters alongside conventional code-optimization parameters. We evaluate our accuracy-aware tuning solution on both Nvidia and AMD GPUs using a variety of kernels. Our results show speedups of up to $12{times }$ over double precision, demonstrate how Kernel Tuner’s built-in search strategies are effective for accuracy-aware tuning, and show that our approach can be extended to other optimization objectives, such as memory footprint or energy efficiency. Moreover, we highlight that jointly tuning accuracy- and performance-affecting parameters outperforms isolated approaches in finding the best-performing configurations, despite significantly expanding the optimization space. This unified approach enables developers to trade accuracy for throughput systematically, enabling broader adoption of mixed-precision computing in scientific and industrial applications.
低精度浮点运算在AI和HPC的GPU应用中变得越来越重要,因为它可以提供大量的加速,同时降低能耗和内存占用。然而,选择适当的数据格式带来了一个具有挑战性的调优问题:必须选择精度参数以最大化性能,同时保持数值精度。同时,GPU内核通常会暴露额外的可调优化参数,如块大小、平铺策略和矢量宽度。这两种参数的组合导致在精度和性能之间进行复杂的权衡,使得人工探索所得到的设计空间非常耗时。在这项工作中,我们提出了对开源内核调谐器框架的精度感知扩展,使浮点精度参数与传统代码优化参数一起自动调谐。我们使用各种内核在Nvidia和AMD gpu上评估我们的精度感知调谐解决方案。我们的结果显示,在双精度的基础上,速度提高了12倍,演示了Kernel Tuner的内置搜索策略如何有效地进行精度感知调优,并表明我们的方法可以扩展到其他优化目标,例如内存占用或能源效率。此外,我们强调,尽管显著扩展了优化空间,但联合调优影响精度和性能的参数在寻找最佳性能配置方面优于孤立的方法。这种统一的方法使开发人员能够系统地以精度换取吞吐量,从而在科学和工业应用中更广泛地采用混合精度计算。
{"title":"Accuracy-Aware Mixed-Precision GPU Auto-Tuning","authors":"Stijn Heldens;Ben van Werkhoven","doi":"10.1109/TPDS.2026.3659324","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3659324","url":null,"abstract":"Reduced-precision floating-point arithmetic has become increasingly important in GPU applications for AI and HPC, as it can deliver substantial speedups while reducing energy consumption and memory footprint. However, choosing the appropriate data formats brings a challenging tuning problem: precision parameters must be chosen to maximize performance while preserving numerical accuracy. At the same time, GPU kernels typically expose additional tunable optimization parameters, such as block size, tiling strategy, and vector width. The combination of these two kinds of parameters results in a complex trade-off between accuracy and performance, making manual exploration of the resulting design space time-consuming. In this work, we present an <i>accuracy-aware</i> extension to the open-source <i>Kernel Tuner</i> framework, enabling automatic tuning of floating-point precision parameters alongside conventional code-optimization parameters. We evaluate our accuracy-aware tuning solution on both Nvidia and AMD GPUs using a variety of kernels. Our results show speedups of up to <inline-formula><tex-math>$12{times }$</tex-math></inline-formula> over double precision, demonstrate how Kernel Tuner’s built-in search strategies are effective for accuracy-aware tuning, and show that our approach can be extended to other optimization objectives, such as memory footprint or energy efficiency. Moreover, we highlight that jointly tuning accuracy- and performance-affecting parameters outperforms isolated approaches in finding the best-performing configurations, despite significantly expanding the optimization space. This unified approach enables developers to trade accuracy for throughput systematically, enabling broader adoption of mixed-precision computing in scientific and industrial applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"867-884"},"PeriodicalIF":6.0,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11367475","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Performance of SMASH: A Non-Preemptive Window-Based Scheduler for Multiserver Jobs 基于windows的多服务器作业非抢占调度程序SMASH的性能研究
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-28 DOI: 10.1109/TPDS.2026.3657959
Diletta Olliaro;Sabina Rossi;Adityo Anggraito;Andrea Marin;Marco Ajmone Marsan
The efficient execution of data center jobs that require simultaneous use of different resource types is of critical importance. When processing capacity is the crucial resource for jobs execution, the locution multiserver jobs is used, where the term server indicates processors or CPU cores providing processing capacity. Each multiserver job carries a requirement expressed in number of servers it requires to run, and service duration. Achieving efficient execution of multiserver jobs relies heavily on effective scheduling of jobs on the existing servers. Several schedulers have been proposed, aimed at improving resource utilization, at the cost of increased complexity. Due to the limited availability of theoretical results on scheduler behavior in the case of multiserver jobs, data center schedulers are often designed based only on managers’ experience. In this article, aiming to expand the understanding of the multiserver job schedulers’ performance, we study Small Shuffle (SMASH) schedulers, a class of nonpreemptive, service time oblivious, window-based multiserver job scheduling algorithms that strike a balance between simplicity and efficient resource utilization, while allowing performance evaluation in simpler settings. SMASH implies only a marginal increase in complexity compared to FIFO, yet it delivers substantial performance improvements for multiserver jobs. Depending on the system parameters, SMASH can nearly double the system’s stability region with respect to FIFO, leading to significantly lower response times across a broad region of loads. Moreover, the magnitude of this improvement scales with the chosen window size, allowing performance to be tuned to the system’s operating conditions. We first study the capacity of SMASH with analytical tools in simple settings, then we investigate the performance of SMASH and other schedulers with simulations under more realistic workloads, designed with parameters derived from measurements of real data centers. Results show that SMASH offers a very good compromise between performance and complexity.
高效执行需要同时使用不同资源类型的数据中心作业是至关重要的。当处理能力是作业执行的关键资源时,使用多服务器作业,其中术语服务器表示提供处理能力的处理器或CPU内核。每个多服务器作业都有一个需求,表示为需要运行的服务器数量和服务持续时间。实现多服务器作业的高效执行在很大程度上依赖于对现有服务器上作业的有效调度。已经提出了几个调度器,旨在提高资源利用率,但代价是增加了复杂性。由于在多服务器作业的情况下,关于调度器行为的理论结果的可用性有限,数据中心调度器通常仅基于管理人员的经验来设计。在本文中,为了扩展对多服务器作业调度器性能的理解,我们研究了Small Shuffle (SMASH)调度器,这是一类非抢占式的、服务时间无关的、基于窗口的多服务器作业调度算法,它在简单性和高效资源利用之间取得了平衡,同时允许在更简单的设置中进行性能评估。与FIFO相比,SMASH的复杂性只增加了一点点,但它为多服务器作业提供了实质性的性能改进。根据系统参数的不同,SMASH可以使系统的稳定区域几乎是FIFO的两倍,从而在广泛的负载区域内显著降低响应时间。此外,这种改进的幅度随选择的窗口大小而变化,从而允许根据系统的操作条件调整性能。我们首先使用分析工具在简单设置下研究SMASH的容量,然后通过模拟研究SMASH和其他调度器在更现实的工作负载下的性能,这些工作负载的设计参数来自于真实数据中心的测量。结果表明,SMASH在性能和复杂性之间提供了很好的折衷。
{"title":"On the Performance of SMASH: A Non-Preemptive Window-Based Scheduler for Multiserver Jobs","authors":"Diletta Olliaro;Sabina Rossi;Adityo Anggraito;Andrea Marin;Marco Ajmone Marsan","doi":"10.1109/TPDS.2026.3657959","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3657959","url":null,"abstract":"The efficient execution of data center jobs that require simultaneous use of different resource types is of critical importance. When processing capacity is the crucial resource for jobs execution, the locution <italic>multiserver jobs</i> is used, where the term server indicates processors or CPU cores providing processing capacity. Each multiserver job carries a requirement expressed in number of servers it requires to run, and service duration. Achieving efficient execution of multiserver jobs relies heavily on effective scheduling of jobs on the existing servers. Several schedulers have been proposed, aimed at improving resource utilization, at the cost of increased complexity. Due to the limited availability of theoretical results on scheduler behavior in the case of multiserver jobs, data center schedulers are often designed based only on managers’ experience. In this article, aiming to expand the understanding of the multiserver job schedulers’ performance, we study Small Shuffle (SMASH) schedulers, a class of nonpreemptive, service time oblivious, window-based multiserver job scheduling algorithms that strike a balance between simplicity and efficient resource utilization, while allowing performance evaluation in simpler settings. SMASH implies only a marginal increase in complexity compared to FIFO, yet it delivers substantial performance improvements for multiserver jobs. Depending on the system parameters, SMASH can nearly double the system’s stability region with respect to FIFO, leading to significantly lower response times across a broad region of loads. Moreover, the magnitude of this improvement scales with the chosen window size, allowing performance to be tuned to the system’s operating conditions. We first study the capacity of SMASH with analytical tools in simple settings, then we investigate the performance of SMASH and other schedulers with simulations under more realistic workloads, designed with parameters derived from measurements of real data centers. Results show that SMASH offers a very good compromise between performance and complexity.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"966-981"},"PeriodicalIF":6.0,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11364077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting the Performance Potential of Extreme-Scale Earthquake Simulation: Achieving 86.7 PFLOPS With Over 39 Million Cores 挖掘极端尺度地震模拟的性能潜力:用超过3900万核实现86.7 PFLOPS
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-28 DOI: 10.1109/TPDS.2026.3658568
Lin Gan;Wubing Wan;Zekun Yin;Wenqiang Wang;Yilong Li;Zhenguo Zhang;Zhong He;Ping Gao;Xiaohui Duan;Weiguo Liu;Wei Xue;Haohuan Fu;Guangwen Yang;Xiaofei Chen
Leveraging the latest Sunway supercomputer, we developed a fully optimized earthquake simulation model that accurately captures topographic effects for realistic seismic analysis. Optimizing for the SW26010Pro architecture with DMA/RMA communication mechanisms, data compression schemes, and vectorization, we achieved a speedup exceeding 160×. Our pipeline-based computation and communication overlapping scheme, combined with performance prediction models further minimized computational costs. These optimizations enabled the largest-scale curvilinear grid finite-difference method (CGFDM) earthquake simulations to date, covering 197 trillion grid points and achieving 86.7 PFLOPS on 39 million cores with a weak scaling efficiency of 97.9%. These advancements enabled the successful simulation of the 2008 Wenchuan earthquake, providing high-resolution seismic insights and robust assessments for regional hazard mitigation and disaster preparedness.
利用最新的神威超级计算机,我们开发了一个全面优化的地震模拟模型,准确捕捉地形效应,进行真实的地震分析。通过DMA/RMA通信机制、数据压缩方案和矢量化对SW26010Pro架构进行优化,我们实现了超过160倍的加速。我们基于管道的计算和通信重叠方案,结合性能预测模型,进一步降低了计算成本。这些优化实现了迄今为止最大规模的曲线网格有限差分法(CGFDM)地震模拟,覆盖了197万亿个网格点,在3900万个核上实现了86.7 PFLOPS,缩放效率为97.9%。这些进展使2008年汶川地震的成功模拟成为可能,为区域减灾和备灾提供了高分辨率的地震信息和可靠的评估。
{"title":"Exploiting the Performance Potential of Extreme-Scale Earthquake Simulation: Achieving 86.7 PFLOPS With Over 39 Million Cores","authors":"Lin Gan;Wubing Wan;Zekun Yin;Wenqiang Wang;Yilong Li;Zhenguo Zhang;Zhong He;Ping Gao;Xiaohui Duan;Weiguo Liu;Wei Xue;Haohuan Fu;Guangwen Yang;Xiaofei Chen","doi":"10.1109/TPDS.2026.3658568","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658568","url":null,"abstract":"Leveraging the latest Sunway supercomputer, we developed a fully optimized earthquake simulation model that accurately captures topographic effects for realistic seismic analysis. Optimizing for the SW26010Pro architecture with DMA/RMA communication mechanisms, data compression schemes, and vectorization, we achieved a speedup exceeding 160×. Our pipeline-based computation and communication overlapping scheme, combined with performance prediction models further minimized computational costs. These optimizations enabled the largest-scale curvilinear grid finite-difference method (CGFDM) earthquake simulations to date, covering 197 trillion grid points and achieving 86.7 PFLOPS on 39 million cores with a weak scaling efficiency of 97.9%. These advancements enabled the successful simulation of the 2008 Wenchuan earthquake, providing high-resolution seismic insights and robust assessments for regional hazard mitigation and disaster preparedness.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"997-1014"},"PeriodicalIF":6.0,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible Performant Tensor Contractions on GPUs gpu上灵活的高性能张量收缩
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-27 DOI: 10.1109/TPDS.2026.3658162
Thomas Faingnaert;Ward Vermeulen;Tim Besard;Bjorn De Sutter
Tensor contractions extend the concept of the General Matrix Multiplication (GEMM) to high-dimensional spaces. They enable sophisticated computations in various scientific disciplines. Graphics Processing Units (GPUs) are commonly used to accelerate tensor contraction algorithms due to their inherent parallelisability. NVIDIA’s cuTENSOR stands as a state-of-the-art library for GPU-based tensor contractions. However, its lack of flexibility limits researchers in tailoring contraction kernels to their specific research needs. This paper presents a novel and flexible implementation of the GEMM-like Tensor Tensor (GETT) multiplication algorithm for tensor contractions in Julia. By repurposing and adapting components of GemmKernels.jl, a versatile library offering customisable and high-performance GEMM kernels for CUDA-enabled GPUs, we construct GEMM-like kernels that cater to the unique requirements of tensor contractions. Despite being entirely written in high-level Julia code and not yet exploiting a range of modern CUDA hardware features, the average performance of our library on standard tensor contractions compares favourably to cuTENSOR’s hand-optimised implementations, with outliers in both directions (faster and slower). When flexibility is needed, e.g. to fuse arbitrary elementwise operations into kernels, our library performs up to an order of magnitude faster than cuTENSOR, even on recent, data centre-grade devices such as the RTX 6000 Ada.
张量收缩将一般矩阵乘法(GEMM)的概念扩展到高维空间。它们使各种科学学科的复杂计算成为可能。图形处理单元(gpu)由于其固有的并行性,通常用于加速张量收缩算法。NVIDIA的cuTENSOR是最先进的基于gpu的张量收缩库。然而,它缺乏灵活性,限制了研究人员根据他们的具体研究需要定制收缩核。本文在Julia中提出了一种新颖灵活的类gemm张量张量(GETT)乘法算法用于张量收缩。通过重新利用和调整gemmkernel的组件。jl是一个通用的库,为支持cuda的gpu提供可定制的高性能GEMM内核,我们构建了类似GEMM的内核,以满足张量收缩的独特要求。尽管完全是用高级Julia代码编写的,并且还没有利用一系列现代CUDA硬件功能,我们的库在标准张量收缩上的平均性能比cuTENSOR的手动优化实现要好,在两个方向上都有异常值(更快和更慢)。当需要灵活性时,例如,将任意元素操作融合到内核中,我们的库的执行速度比cuTENSOR快一个数量级,即使在最近的数据中心级设备(如RTX 6000 Ada)上也是如此。
{"title":"Flexible Performant Tensor Contractions on GPUs","authors":"Thomas Faingnaert;Ward Vermeulen;Tim Besard;Bjorn De Sutter","doi":"10.1109/TPDS.2026.3658162","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658162","url":null,"abstract":"Tensor contractions extend the concept of the General Matrix Multiplication (GEMM) to high-dimensional spaces. They enable sophisticated computations in various scientific disciplines. Graphics Processing Units (GPUs) are commonly used to accelerate tensor contraction algorithms due to their inherent parallelisability. NVIDIA’s cuTENSOR stands as a state-of-the-art library for GPU-based tensor contractions. However, its lack of flexibility limits researchers in tailoring contraction kernels to their specific research needs. This paper presents a novel and flexible implementation of the GEMM-like Tensor Tensor (GETT) multiplication algorithm for tensor contractions in Julia. By repurposing and adapting components of GemmKernels.jl, a versatile library offering customisable and high-performance GEMM kernels for CUDA-enabled GPUs, we construct GEMM-like kernels that cater to the unique requirements of tensor contractions. Despite being entirely written in high-level Julia code and not yet exploiting a range of modern CUDA hardware features, the average performance of our library on standard tensor contractions compares favourably to cuTENSOR’s hand-optimised implementations, with outliers in both directions (faster and slower). When flexibility is needed, e.g. to fuse arbitrary elementwise operations into kernels, our library performs up to an order of magnitude faster than cuTENSOR, even on recent, data centre-grade devices such as the RTX 6000 Ada.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"787-804"},"PeriodicalIF":6.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nexus: A Novel Transaction Processing Framework for Permissioned Blockchain Nexus:一种新的允许bb0事务处理框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-27 DOI: 10.1109/TPDS.2026.3658222
Shengjie Guan;Rongkai Zhang;Qiuyu Ding;Mingxuan Song;Zhen Xiao;Jieyi Long;Mingchao Wan;Taifu Yuan;Jin Dong
The transaction execution layer is a key determinant of throughput in permissioned blockchains. While recent Shared Memory Pools (SMP)-based approaches improve throughput by enabling all consensus nodes to participate in transaction packaging, they face two fundamental limitations. First, the performance bottleneck shifts from the consensus layer to the transaction execution layer as transaction number confirmed in a round increases. Second, these approaches are vulnerable to “transaction duplication” attacks where malicious clients can simultaneously send the same transaction to multiple consensus nodes, thereby decreasing the number of valid transactions in block proposals. To address these limitations, this paper introduces Nexus, a novel blockchain transaction processing framework with high scalability. Nexus leverages the idle computational resources of full nodes to enable transaction execution in parallel with the consensus. Moreover, Nexus allows each node to handle only a fraction of the total transactions and share execution results with others. This approach reduces overall transaction execution time, increases throughput, and decreases latency. Lastly, Nexus introduces a transaction partitioning mechanism that effectively addresses the “transaction duplication” attack and achieves load balancing between clients and consensus nodes. Our implementation of Nexus demonstrates significant improvements: throughput increases by 4x to 15x, and latency is reduced by 50% to 70%.
事务执行层是许可区块链中吞吐量的关键决定因素。虽然最近基于共享内存池(SMP)的方法通过允许所有共识节点参与事务打包来提高吞吐量,但它们面临两个基本限制。首先,随着一轮确认的事务数量的增加,性能瓶颈从共识层转移到事务执行层。其次,这些方法容易受到“交易重复”攻击,恶意客户端可以同时向多个共识节点发送相同的交易,从而减少区块提案中有效交易的数量。为了解决这些限制,本文介绍了Nexus,一种具有高可扩展性的新颖区块链事务处理框架。Nexus利用全节点的空闲计算资源,使事务与共识并行执行。此外,Nexus允许每个节点只处理总事务的一小部分,并与其他节点共享执行结果。这种方法减少了总体事务执行时间,提高了吞吐量,并减少了延迟。最后,Nexus引入了一种事务分区机制,有效解决了“事务重复”攻击,实现了客户端和共识节点之间的负载均衡。我们的Nexus实现显示了显著的改进:吞吐量增加了4倍到15倍,延迟减少了50%到70%。
{"title":"Nexus: A Novel Transaction Processing Framework for Permissioned Blockchain","authors":"Shengjie Guan;Rongkai Zhang;Qiuyu Ding;Mingxuan Song;Zhen Xiao;Jieyi Long;Mingchao Wan;Taifu Yuan;Jin Dong","doi":"10.1109/TPDS.2026.3658222","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3658222","url":null,"abstract":"The transaction execution layer is a key determinant of throughput in permissioned blockchains. While recent Shared Memory Pools (SMP)-based approaches improve throughput by enabling all consensus nodes to participate in transaction packaging, they face two fundamental limitations. First, the performance bottleneck shifts from the consensus layer to the transaction execution layer as transaction number confirmed in a round increases. Second, these approaches are vulnerable to “transaction duplication” attacks where malicious clients can simultaneously send the same transaction to multiple consensus nodes, thereby decreasing the number of valid transactions in block proposals. To address these limitations, this paper introduces <italic>Nexus</i>, a novel blockchain transaction processing framework with high scalability. <italic>Nexus</i> leverages the idle computational resources of full nodes to enable transaction execution in parallel with the consensus. Moreover, <italic>Nexus</i> allows each node to handle only a fraction of the total transactions and share execution results with others. This approach reduces overall transaction execution time, increases throughput, and decreases latency. Lastly, <italic>Nexus</i> introduces a transaction partitioning mechanism that effectively addresses the “transaction duplication” attack and achieves load balancing between clients and consensus nodes. Our implementation of <italic>Nexus</i> demonstrates significant improvements: throughput increases by 4x to 15x, and latency is reduced by 50% to 70%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"822-835"},"PeriodicalIF":6.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
dnccVAT : A Fully Distributed Approach for Clustering Tendency Assessment of IoT Generated Spatio-Temporal Data dnccVAT:物联网时空数据聚类趋势评估的全分布式方法
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-26 DOI: 10.1109/TPDS.2026.3657795
Kartik Vishal Deshpande;Dheeraj Kumar;Osmar R Zaïane
Clustering spatio-temporal data in distributed systems is crucial for various applications such as traffic management, smart cities, telecommunications, and environmental monitoring. Despite the notable progress made in this field, several significant challenges persist: (a) in centralized systems, spatio-temporal data clustering necessitates that data be sent to the cloud for processing, which raises concerns about data transmission costs, latency, and privacy and security, (b) centralized systems incur high computational costs and require expensive hardware, resulting in prolonged runtime for algorithms, and (c) lack of well-defined space and time contiguous clusters adversely affects the overall usability of the clusters produced. These challenges are addressed by the proposed dnccVAT algorithm for assessing clustering tendency in spatio-temporal data within distributed systems, which is part of the visual assessment of clustering tendency family of algorithms. This algorithm effectively navigates the complexities associated with spatial-temporal relationships while minimizing communication overhead and ensuring scalability across distributed participant nodes. Extensive experiments were carried out on six real-world datasets, one of them being high-dimensional Big Data, comparing the proposed method with four state-of-the-art spatio-temporal data clustering algorithms and evaluating seven different performance measures to provide valuable insights into the effectiveness of the proposed approach.
分布式系统中的时空数据聚类对于交通管理、智慧城市、电信和环境监测等各种应用至关重要。尽管在这一领域取得了显著进展,但仍存在一些重大挑战:(a)在集中式系统中,时空数据集群需要将数据发送到云端进行处理,这引发了对数据传输成本、延迟以及隐私和安全的担忧;(b)集中式系统产生高计算成本,需要昂贵的硬件,导致算法运行时间延长;(c)缺乏定义良好的空间和时间连续集群,对所产生集群的整体可用性产生不利影响。提出的dnccVAT算法用于评估分布式系统中时空数据的聚类趋势,该算法是聚类趋势视觉评估算法家族的一部分,可以解决这些挑战。该算法有效地处理了与时空关系相关的复杂性,同时最小化了通信开销并确保了分布式参与节点之间的可伸缩性。在六个真实数据集上进行了大量实验,其中一个是高维大数据,将所提出的方法与四种最先进的时空数据聚类算法进行了比较,并评估了七种不同的性能指标,以提供对所提出方法有效性的有价值的见解。
{"title":"dnccVAT : A Fully Distributed Approach for Clustering Tendency Assessment of IoT Generated Spatio-Temporal Data","authors":"Kartik Vishal Deshpande;Dheeraj Kumar;Osmar R Zaïane","doi":"10.1109/TPDS.2026.3657795","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3657795","url":null,"abstract":"Clustering spatio-temporal data in distributed systems is crucial for various applications such as traffic management, smart cities, telecommunications, and environmental monitoring. Despite the notable progress made in this field, several significant challenges persist: (a) in centralized systems, spatio-temporal data clustering necessitates that data be sent to the cloud for processing, which raises concerns about data transmission costs, latency, and privacy and security, (b) centralized systems incur high computational costs and require expensive hardware, resulting in prolonged runtime for algorithms, and (c) lack of well-defined space and time contiguous clusters adversely affects the overall usability of the clusters produced. These challenges are addressed by the proposed dnccVAT algorithm for assessing clustering tendency in spatio-temporal data within distributed systems, which is part of the visual assessment of clustering tendency family of algorithms. This algorithm effectively navigates the complexities associated with spatial-temporal relationships while minimizing communication overhead and ensuring scalability across distributed participant nodes. Extensive experiments were carried out on six real-world datasets, one of them being high-dimensional Big Data, comparing the proposed method with four state-of-the-art spatio-temporal data clustering algorithms and evaluating seven different performance measures to provide valuable insights into the effectiveness of the proposed approach.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"762-774"},"PeriodicalIF":6.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pilot: Power-Aware Hybrid Fault Tolerance in Multi-Core Embedded Systems 试点:多核嵌入式系统的功率感知混合容错
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-20 DOI: 10.1109/TPDS.2026.3655842
Amir Hossein Ansari;Moein Esnaashari;Sepideh Safari;Mohsen Ansari;Alireza Ejlali;Jörg Henkel
With the advancement of technology size and the integration of multiple cores on a single chip, the probability of fault occurrence has increased. These faults can be transient or permanent, requiring techniques to manage both types. Hybrid fault tolerance techniques have emerged as effective solutions to handle both types. In this paper, we propose a power-aware hybrid fault tolerance (called Pilot). Our approach utilizes checkpointing with rollback-recovery and primary/backup techniques, tolerating two kinds of faults. Moreover, in real-time embedded systems, power consumption is a critical constraint that must be managed. To do this, we exploit the Thermal Safe Power (TSP) constraint for each processing core. Based on this constraint and the utilization of each core, tasks are mapped and scheduled, while guaranteeing the timing constraints. Our experimental results demonstrate that our proposed methods can meet the reliability target by tolerating the optimal number of fault occurrences in each task while reducing power consumption. Our proposed methods are compared to state-of-the-art techniques in terms of schedulability, power consumption, Quality of Service (QoS), energy consumption, and reliability. The peak power and energy consumption are reduced on average by 34.2% and 15.9%, respectively, the QoS is improved on average to 28.7%, and the schedulability is improved on average to 14.6% while satisfying the system reliability target.
随着技术规模的进步和多核在单芯片上的集成化,故障发生的概率越来越大。这些故障可能是暂时的,也可能是永久的,需要技术来管理这两种类型。混合容错技术已经成为处理这两种类型的有效解决方案。在本文中,我们提出了一种功率感知的混合容错(称为Pilot)。我们的方法利用带有回滚恢复和主/备份技术的检查点,允许两种类型的错误。此外,在实时嵌入式系统中,功耗是必须管理的关键约束。为此,我们利用每个处理核心的热安全功率(TSP)约束。基于此约束和每个核心的利用率,在保证时间约束的前提下,对任务进行映射和调度。实验结果表明,所提出的方法能够在保证每个任务的最优故障发生次数的同时降低功耗,从而满足可靠性目标。我们提出的方法在可调度性、功耗、服务质量(QoS)、能耗和可靠性方面与最先进的技术进行了比较。在满足系统可靠性目标的前提下,峰值功耗和能耗平均降低34.2%和15.9%,QoS平均提高到28.7%,可调度性平均提高到14.6%。
{"title":"Pilot: Power-Aware Hybrid Fault Tolerance in Multi-Core Embedded Systems","authors":"Amir Hossein Ansari;Moein Esnaashari;Sepideh Safari;Mohsen Ansari;Alireza Ejlali;Jörg Henkel","doi":"10.1109/TPDS.2026.3655842","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3655842","url":null,"abstract":"With the advancement of technology size and the integration of multiple cores on a single chip, the probability of fault occurrence has increased. These faults can be transient or permanent, requiring techniques to manage both types. Hybrid fault tolerance techniques have emerged as effective solutions to handle both types. In this paper, we propose a power-aware hybrid fault tolerance (called Pilot). Our approach utilizes checkpointing with rollback-recovery and primary/backup techniques, tolerating two kinds of faults. Moreover, in real-time embedded systems, power consumption is a critical constraint that must be managed. To do this, we exploit the Thermal Safe Power (TSP) constraint for each processing core. Based on this constraint and the utilization of each core, tasks are mapped and scheduled, while guaranteeing the timing constraints. Our experimental results demonstrate that our proposed methods can meet the reliability target by tolerating the optimal number of fault occurrences in each task while reducing power consumption. Our proposed methods are compared to state-of-the-art techniques in terms of schedulability, power consumption, Quality of Service (QoS), energy consumption, and reliability. The peak power and energy consumption are reduced on average by 34.2% and 15.9%, respectively, the QoS is improved on average to 28.7%, and the schedulability is improved on average to 14.6% while satisfying the system reliability target.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"726-743"},"PeriodicalIF":6.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hammurabi: Establish Cooperative Order From Pre-Trained Policies in Multi-UAV Networks 汉谟拉比:在多无人机网络中通过预先训练的政策建立合作秩序
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-19 DOI: 10.1109/TPDS.2026.3654605
Dezhi Chen;Hongchuan He;Qi Qi;Jingyu Wang;Rongxin Han;Bo He;Zirui Zhuang;Qianlong Fu;Jianxin Liao;Zhu Han
Multi-agent cooperation is an open challenge in intelligent transportation systems (ITS). Traditional rule-based algorithms struggle to adapt to dynamic and uncertain environments, while learning-based algorithms are hindered by the scarcity and cost of labeled data. Reinforcement Learning (RL) offers a promising solution within ITS, as it allows for data acquisition through environmental interaction. However, our investigation has identified two primary issues when deploying RL-based algorithms: (1) The design of the reward function should strike a balance between the cooperative and competitive attributes of the system. Purely cooperative reward designs are challenging to learn due to delayed and sparse feedback, while individualized competitive reward designs may promote selfish behavior and rely heavily on expert knowledge. (2) Learning RL from scratch is also problematic due to the reliance of data generation on policy exploration. Pre-training can provide an initial model to circumvent learning difficulties, but its performance is constrained by the traditional algorithm that supplies the data, necessitating novel solutions to further improve model performance. In this paper, we introduce Hammurabi, a framework designed to enhance cooperation and improve the pre-training model within ITS. Hammurabi employs a social dilemma tool to assess the cooperative properties of the pre-trained policy and incorporates them into specific game models. Based on specific game models, we can leverage existing mature conclusions from game theory to assist in the design of reinforcement learning, thereby enhancing agent cooperation. Theoretical analysis shows that by adopting a multi-agent reinforcement learning scheme with policy shared parameters, Hammurabi can converge multi-agent policies to Nash equilibrium. We illustrate the application of Hammurabi in addressing practical issues within a multi-objective optimization multi-UAV system, demonstrating performance improvements across various optimization objectives compared to baseline algorithms.
多智能体协作是智能交通系统(ITS)中的一个开放性挑战。传统的基于规则的算法难以适应动态和不确定的环境,而基于学习的算法则受到标记数据的稀缺性和成本的阻碍。强化学习(RL)在ITS中提供了一个很有前途的解决方案,因为它允许通过环境交互获取数据。然而,我们的研究发现了部署基于强化学习的算法时的两个主要问题:(1)奖励函数的设计应该在系统的合作和竞争属性之间取得平衡。纯粹的合作性奖励设计由于反馈延迟和稀疏,具有学习挑战性,而个性化的竞争性奖励设计可能会促进自私行为,并且严重依赖专家知识。(2)由于数据生成依赖于政策探索,从头开始学习强化学习也存在问题。预训练可以提供一个初始模型来规避学习困难,但其性能受到提供数据的传统算法的限制,需要新的解决方案来进一步提高模型性能。在本文中,我们介绍了汉谟拉比框架,该框架旨在加强ITS内部的合作并改进预培训模型。汉谟拉比使用社会困境工具来评估预训练策略的合作特性,并将其纳入特定的博弈模型。基于具体的博弈模型,我们可以利用已有的成熟的博弈论结论来辅助强化学习的设计,从而增强agent的合作。理论分析表明,采用具有策略共享参数的多智能体强化学习方案,Hammurabi可以使多智能体策略收敛到纳什均衡。我们说明了汉谟拉比在解决多目标优化多无人机系统中的实际问题中的应用,展示了与基线算法相比,不同优化目标的性能改进。
{"title":"Hammurabi: Establish Cooperative Order From Pre-Trained Policies in Multi-UAV Networks","authors":"Dezhi Chen;Hongchuan He;Qi Qi;Jingyu Wang;Rongxin Han;Bo He;Zirui Zhuang;Qianlong Fu;Jianxin Liao;Zhu Han","doi":"10.1109/TPDS.2026.3654605","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3654605","url":null,"abstract":"Multi-agent cooperation is an open challenge in intelligent transportation systems (ITS). Traditional rule-based algorithms struggle to adapt to dynamic and uncertain environments, while learning-based algorithms are hindered by the scarcity and cost of labeled data. Reinforcement Learning (RL) offers a promising solution within ITS, as it allows for data acquisition through environmental interaction. However, our investigation has identified two primary issues when deploying RL-based algorithms: (1) The design of the reward function should strike a balance between the cooperative and competitive attributes of the system. Purely cooperative reward designs are challenging to learn due to delayed and sparse feedback, while individualized competitive reward designs may promote selfish behavior and rely heavily on expert knowledge. (2) Learning RL from scratch is also problematic due to the reliance of data generation on policy exploration. Pre-training can provide an initial model to circumvent learning difficulties, but its performance is constrained by the traditional algorithm that supplies the data, necessitating novel solutions to further improve model performance. In this paper, we introduce Hammurabi, a framework designed to enhance cooperation and improve the pre-training model within ITS. Hammurabi employs a social dilemma tool to assess the cooperative properties of the pre-trained policy and incorporates them into specific game models. Based on specific game models, we can leverage existing mature conclusions from game theory to assist in the design of reinforcement learning, thereby enhancing agent cooperation. Theoretical analysis shows that by adopting a multi-agent reinforcement learning scheme with policy shared parameters, Hammurabi can converge multi-agent policies to Nash equilibrium. We illustrate the application of Hammurabi in addressing practical issues within a multi-objective optimization multi-UAV system, demonstrating performance improvements across various optimization objectives compared to baseline algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"744-761"},"PeriodicalIF":6.0,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimization Method Based on K-WPA for Multinode Cooperative Localization Formation Grouping 基于K-WPA的多节点协同定位编队优化方法
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-16 DOI: 10.1109/TPDS.2026.3655025
Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye
Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.
具有灵活分组能力的多节点协同系统将成为未来的发展趋势,以更好地适应复杂动态的任务需求。为了解决多节点协同定位中合作节点选择的难题,本文提出了一种基于k -均值算法和狼群算法(WPA)的多节点协同定位编队分组优化算法(以下简称K-WPA)。该算法引入更多实际约束来指导多节点聚类分组,从而提高了聚类分组的效率。根据聚类结果,优化WPA的种群更新过程,避免收敛到局部最优。利用Fisher信息矩阵设计了WPA的目标函数,并对队形分组的优化过程进行了评价。分别对20、30、50节点的合作系统进行了动态分组仿真。结果表明,与固定分组相比,K-WPA方法的定位精度提高了41.24%。结合空间分割和并行分组优化的K-WPA算法使千节点群的平均执行时间保持在1s以内。
{"title":"Optimization Method Based on K-WPA for Multinode Cooperative Localization Formation Grouping","authors":"Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye","doi":"10.1109/TPDS.2026.3655025","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3655025","url":null,"abstract":"Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"697-709"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Resource-Efficient Personal Large Language Models Fine-Tuning With Collaborative Edge Computing 资源高效的个人大型语言模型微调与协作边缘计算
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-16 DOI: 10.1109/TPDS.2026.3654957
Shengyuan Ye;Bei Ouyang;Tianyi Qian;Liekang Zeng;Jingyi Li;Jiangsu Du;Xiaowen Chu;Guoliang Xing;Xu Chen
Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose PAC+, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. PAC+ overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, PAC+ introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, PAC+ aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, PAC+ enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that PAC+ significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, PAC+ reduces memory footprint by up to 88.16%.
大型语言模型(llm)使网络边缘的变革性应用程序成为可能,例如智能个人助理。然而,数据隐私和安全问题需要从以云为中心的范式转向基于边缘的个人法学硕士微调。这种转变被密集的计算需求和固有的资源稀缺严重阻碍,形成了一堵“资源墙”,损害了训练的效率和可行性。虽然当前的参数有效微调(PEFT)和资源管理策略试图缓解这些限制,但它们仍然不足以满足单个边缘设备的有限容量。为了应对这些挑战,我们提出了PAC+,这是一个资源高效的协作边缘人工智能框架,用于现场个人LLM微调。PAC+通过复杂的算法-系统协同设计克服了资源瓶颈:(1)在算法上,PAC+引入了一种针对参数、时间和内存进行优化的微调技术。它利用并行适配器来避免需要通过LLM主干进行完整的向后传递。此外,激活缓存机制通过消除跨多个epoch的冗余转发来简化过程。(2) PAC+系统地将邻近边缘设备聚合到一个集体资源池中,采用混合数据和管道并行来编排分布式训练。通过利用激活缓存,PAC+可以通过数据并行性对并行适配器进行排他性微调,从而有效地绕过主干的约束。对原型实现的广泛评估表明,PAC+显著优于现有的协作边缘训练系统,实现了高达9.7倍的端到端加速。此外,与主流LLM微调算法相比,PAC+减少了高达88.16%的内存占用。
{"title":"Resource-Efficient Personal Large Language Models Fine-Tuning With Collaborative Edge Computing","authors":"Shengyuan Ye;Bei Ouyang;Tianyi Qian;Liekang Zeng;Jingyi Li;Jiangsu Du;Xiaowen Chu;Guoliang Xing;Xu Chen","doi":"10.1109/TPDS.2026.3654957","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3654957","url":null,"abstract":"Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose <monospace>PAC+</monospace>, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. <monospace>PAC+</monospace> overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, <monospace>PAC+</monospace> introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, <monospace>PAC+</monospace> aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, <monospace>PAC+</monospace> enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that <monospace>PAC+</monospace> significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, <monospace>PAC+</monospace> reduces memory footprint by up to 88.16%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"680-696"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1