Pub Date : 2026-01-16DOI: 10.1109/TPDS.2026.3655025
Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye
Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.
{"title":"Optimization Method Based on K-WPA for Multinode Cooperative Localization Formation Grouping","authors":"Chun-Li Shao;Liu-Yun He;Pu Yang;Ze-Xia Huang;Guo-Yang Ye","doi":"10.1109/TPDS.2026.3655025","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3655025","url":null,"abstract":"Multinode cooperative system with flexible grouping capabilities will become a future development trend to adapt well to the complex and dynamic mission requirements. To address the challenge of cooperative node selection in multinode cooperative localization, this study proposes an optimization algorithm for formation grouping in multinode cooperative localization based on the K-means algorithm and the wolf pack algorithm (WPA) (referred to as K-WPA). The algorithm incorporates more practical constraints to guide multinode cluster grouping, thereby improving the efficiency of cluster grouping. In accordance with the clustering results, the population update process of the WPA is optimized to avoid convergence to local optima. By using the Fisher information matrix, the objective function of the WPA is designed, and the optimization process of formation grouping is evaluated. Dynamic grouping simulations are conducted for cooperative systems with 20, 30, and 50 nodes. Results indicate that the proposed K-WPA method improves positioning accuracy by up to 41.24% compared to fixed grouping. Furthermore, the K-WPA algorithm combining space division and parallel grouping optimization maintains the average execution time within 1 s for the thousand-node swarm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"697-709"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose PAC+, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. PAC+ overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, PAC+ introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, PAC+ aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, PAC+ enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that PAC+ significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, PAC+ reduces memory footprint by up to 88.16%.
{"title":"Resource-Efficient Personal Large Language Models Fine-Tuning With Collaborative Edge Computing","authors":"Shengyuan Ye;Bei Ouyang;Tianyi Qian;Liekang Zeng;Jingyi Li;Jiangsu Du;Xiaowen Chu;Guoliang Xing;Xu Chen","doi":"10.1109/TPDS.2026.3654957","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3654957","url":null,"abstract":"Large language models (LLMs) have enabled transformative applications at the network edge, such as intelligent personal assistants. However, data privacy and security concerns necessitate a shift from cloud-centric paradigms to edge-based fine-tuning for personal LLMs. This transition is significantly hindered by intensive computational requirements and inherent resource scarcity, creating a “resource wall” that compromises training efficiency and feasibility. While current parameter-efficient fine-tuning (PEFT) and resource management strategies attempt to mitigate these constraints, they remain insufficient for the limited capacities of individual edge devices. To address these challenges, we propose <monospace>PAC+</monospace>, a resourceefficient collaborative edge AI framework for in-situ personal LLM fine-tuning. <monospace>PAC+</monospace> overcomes the resource bottlenecks through a sophisticated algorithm-system codesign: (1) Algorithmically, <monospace>PAC+</monospace> introduces a fine-tuning technique optimized for parameters, time, and memory. It utilizes Parallel Adapters to circumvent the need for a full backward pass through the LLM backbone. Furthermore, an activation cache mechanism streamlines the process by negating redundant forward passes across multiple epochs. (2) Systematically, <monospace>PAC+</monospace> aggregates proximate edge devices into a collective resource pool, employing hybrid data and pipeline parallelism to orchestrate distributed training. By leveraging the activation cache, <monospace>PAC+</monospace> enables the exclusive fine-tuning of Parallel Adapters via data parallelism, effectively bypassing the backbone's constraints. Extensive evaluation of the prototype implementation demonstrates that <monospace>PAC+</monospace> significantly outperforms existing collaborative edge training systems, achieving up to a 9.7× end-to-end speedup. Furthermore, compared to mainstream LLM fine-tuning algorithms, <monospace>PAC+</monospace> reduces memory footprint by up to 88.16%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"680-696"},"PeriodicalIF":6.0,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-12DOI: 10.1109/TPDS.2026.3652171
Chunlin Li;Jiaqi Wang;Kun Jiang;Cheng Xiong;Shaohua Wan
In recent years, deep neural networks (DNNs) have been widely used in Vehicular Edge Computing (VEC), becoming the core technology for most intelligent applications. However, these DNN inference tasks are usually computation-intensive and latency-sensitive. In urban autonomous driving scenarios, when a large number of vehicles offload tasks to roadside units (RSUs), they face the problem of computational overload of edge servers and inference delay beyond tolerable limits. To address these challenges, we propose an edge-vehicle collaborative inference acceleration mechanism, namely Model partitioning and Early-exit point selection joint Optimization for Collaborative Inference (MEOCI). Specifically, we dynamically select the optimal model partitioning points with the constraint of RSU computing resources and vehicle computing capabilities; and according to the accuracy threshold set to choose the appropriate early exit point. The goal is to minimize the average inference delay under the inference accuracy constraint. Therefore, we propose the Adaptive Dual-Pool Dueling Double Deep Q-Network (ADP-D3QN) algorithm, which enhances the exploration strategy and experience replay mechanism of D3QN to implement the proposed optimization mechanism MEOCI. We conduct comprehensive performance evaluations using four DNN models: AlexNet, VGG16, ResNet50, YOLOv10n. Experimental results show the proposed ADP-D3QN algorithm reduces average inference delay by 15.8% for AlexNet and 8.7% for VGG16 compared to baseline algorithm.
{"title":"MEOCI: Model Partitioning and Early-Exit Point Selection Joint Optimization for Collaborative Inference in Vehicular Edge Computing","authors":"Chunlin Li;Jiaqi Wang;Kun Jiang;Cheng Xiong;Shaohua Wan","doi":"10.1109/TPDS.2026.3652171","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3652171","url":null,"abstract":"In recent years, deep neural networks (DNNs) have been widely used in Vehicular Edge Computing (VEC), becoming the core technology for most intelligent applications. However, these DNN inference tasks are usually computation-intensive and latency-sensitive. In urban autonomous driving scenarios, when a large number of vehicles offload tasks to roadside units (RSUs), they face the problem of computational overload of edge servers and inference delay beyond tolerable limits. To address these challenges, we propose an edge-vehicle collaborative inference acceleration mechanism, namely Model partitioning and Early-exit point selection joint Optimization for Collaborative Inference (MEOCI). Specifically, we dynamically select the optimal model partitioning points with the constraint of RSU computing resources and vehicle computing capabilities; and according to the accuracy threshold set to choose the appropriate early exit point. The goal is to minimize the average inference delay under the inference accuracy constraint. Therefore, we propose the Adaptive Dual-Pool Dueling Double Deep Q-Network (ADP-D3QN) algorithm, which enhances the exploration strategy and experience replay mechanism of D3QN to implement the proposed optimization mechanism MEOCI. We conduct comprehensive performance evaluations using four DNN models: AlexNet, VGG16, ResNet50, YOLOv10n. Experimental results show the proposed ADP-D3QN algorithm reduces average inference delay by 15.8% for AlexNet and 8.7% for VGG16 compared to baseline algorithm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"666-679"},"PeriodicalIF":6.0,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1109/TPDS.2025.3649863
Zihao Zhou;Shusen Yang;Fangyuan Zhao;Xuebin Ren
Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.
{"title":"FairGFL: Privacy-Preserving Fairness-Aware Federated Learning With Overlapping Subgraphs","authors":"Zihao Zhou;Shusen Yang;Fangyuan Zhao;Xuebin Ren","doi":"10.1109/TPDS.2025.3649863","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3649863","url":null,"abstract":"Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"710-725"},"PeriodicalIF":6.0,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix-accelerated stencil computation is a hot research topic, yet its application to 3 dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of Scalable Matrix Extension(SME) on ARMv9-A CPU, we analyze SME-based accelerating strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on Scalable Vector Extension(SVE) and SME unit to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. SMEStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, SMEStencil outperforms state-of-the-art libraries on Nividia A100 GPGPU by up to 2.1×. Moreover, the performance improvements enabled by our optimizations translate directly to real-world HPC applications and enable Reverse Time Migration(RTM) real-world applications to yield 1.8x speedup versus highly-optimized Nvidia A100 GPGPU version.
{"title":"SMEStencil: Optimizing High-Order Stencils on ARM Multicore Using SME Unit","authors":"Yinuo Wang;Tianqi Mao;Lin Gan;Wubing Wan;Zeyu Song;Jiayu Fu;Lanke He;Wenqiang Wang;Zekun Yin;Wei Xue;Guangwen Yang","doi":"10.1109/TPDS.2025.3650515","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3650515","url":null,"abstract":"Matrix-accelerated stencil computation is a hot research topic, yet its application to 3 dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of Scalable Matrix Extension(SME) on ARMv9-A CPU, we analyze SME-based accelerating strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on Scalable Vector Extension(SVE) and SME unit to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. SMEStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, SMEStencil outperforms state-of-the-art libraries on Nividia A100 GPGPU by up to 2.1×. Moreover, the performance improvements enabled by our optimizations translate directly to real-world HPC applications and enable Reverse Time Migration(RTM) real-world applications to yield 1.8x speedup versus highly-optimized Nvidia A100 GPGPU version.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"651-665"},"PeriodicalIF":6.0,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1109/TPDS.2025.3650593
Wang Zhang;Hongyu Wang;Zhan Shi;Yutong Wu;Mingjin Li;Tingfang Li;Fang Wang;Dan Feng
The growing volume of performance-critical parameters in distributed storage systems, coupled with diverse and dynamic workload patterns, has significantly increased the complexity of system configuration. These trends have expanded the parameter space while tightening the time window for tuning convergence, making it challenging to maintain high system performance. Existing tuning strategies often struggle to balance thorough parameter exploration with real-time responsiveness, limiting their effectiveness under fast-evolving workloads and heterogeneous deployment environments. To address these challenges, we propose KGQW, the first framework that formulates automated parameter tuning as a knowledge graph query workflow. KGQW models workload features and system parameters as graph vertices, with performance metrics represented as edges, and constructs an initial knowledge graph through lightweight performance tests. Guided by performance prediction and Bayesian-driven exploration, KGQW progressively expands the graph, prunes insensitive parameters, and refines performance relationships to build an informative and reusable knowledge graph that supports rapid configuration retrieval via graph querying. Moreover, KGQW enables efficient knowledge transfer across clusters, substantially reducing the construction cost for new clusters. Experiments on real-world applications and storage clusters demonstrate that KGQW achieves second-level tuning latency, while maintaining or surpassing the performance of state-of-the-art methods. These results highlight the promise of knowledge-driven tuning in meeting the scalability and adaptability demands of modern distributed storage systems.
{"title":"Rethinking Parameter Tuning in Distributed Storage Systems via Knowledge Graph Query","authors":"Wang Zhang;Hongyu Wang;Zhan Shi;Yutong Wu;Mingjin Li;Tingfang Li;Fang Wang;Dan Feng","doi":"10.1109/TPDS.2025.3650593","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3650593","url":null,"abstract":"The growing volume of performance-critical parameters in distributed storage systems, coupled with diverse and dynamic workload patterns, has significantly increased the complexity of system configuration. These trends have expanded the parameter space while tightening the time window for tuning convergence, making it challenging to maintain high system performance. Existing tuning strategies often struggle to balance thorough parameter exploration with real-time responsiveness, limiting their effectiveness under fast-evolving workloads and heterogeneous deployment environments. To address these challenges, we propose KGQW, the first framework that formulates automated parameter tuning as a knowledge graph query workflow. KGQW models workload features and system parameters as graph vertices, with performance metrics represented as edges, and constructs an initial knowledge graph through lightweight performance tests. Guided by performance prediction and Bayesian-driven exploration, KGQW progressively expands the graph, prunes insensitive parameters, and refines performance relationships to build an informative and reusable knowledge graph that supports rapid configuration retrieval via graph querying. Moreover, KGQW enables efficient knowledge transfer across clusters, substantially reducing the construction cost for new clusters. Experiments on real-world applications and storage clusters demonstrate that KGQW achieves second-level tuning latency, while maintaining or surpassing the performance of state-of-the-art methods. These results highlight the promise of knowledge-driven tuning in meeting the scalability and adaptability demands of modern distributed storage systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"633-650"},"PeriodicalIF":6.0,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale data analytics workflows ingest massive input data into various data structures, including graphs and key-value datastores. These data structures undergo multiple transformations and computations and are typically reused in incremental and iterative analytics workflows. Persisting in-memory views of these data structures enables reusing them beyond the scope of a single program run while avoiding repetitive raw data ingestion overheads. Memory-mapped I/O enables persisting in-memory data structures without data serialization and deserialization overheads. However, memory-mapped I/O lacks the key feature of persisting consistent snapshots of these data structures for incremental ingestion and processing. The obstacles to efficient virtual memory snapshots using memory-mapped I/O include background writebacks outside the application’s control, and the significantly high storage footprint of such snapshots. To address these limitations, we present Privateer, a memory and storage management tool that enables storage-efficient virtual memory snapshotting while also optimizing snapshot I/O performance. We integrated Privateer into Metall, a state-of-the-art persistent memory allocator for C++, and the Lightning Memory-Mapped Database (LMDB), a widely-used key-value datastore in data analytics and machine learning. Privateer optimized application performance by 1.22× when storing data structure snapshots to node-local storage, and up to 16.7× when storing snapshots to a parallel file system. Privateer also optimizes storage efficiency of incremental data structure snapshots by up to 11× using data deduplication and compression.
{"title":"Optimizing Management of Persistent Data Structures in High-Performance Analytics","authors":"Karim Youssef;Keita Iwabuchi;Maya Gokhale;Wu-chun Feng;Roger Pearce","doi":"10.1109/TPDS.2025.3646133","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3646133","url":null,"abstract":"Large-scale data analytics workflows ingest massive input data into various data structures, including graphs and key-value datastores. These data structures undergo multiple transformations and computations and are typically reused in incremental and iterative analytics workflows. Persisting in-memory views of these data structures enables reusing them beyond the scope of a single program run while avoiding repetitive raw data ingestion overheads. Memory-mapped I/O enables persisting in-memory data structures without data serialization and deserialization overheads. However, memory-mapped I/O lacks the key feature of persisting consistent snapshots of these data structures for incremental ingestion and processing. The obstacles to efficient virtual memory snapshots using memory-mapped I/O include background writebacks outside the application’s control, and the significantly high storage footprint of such snapshots. To address these limitations, we present <italic>Privateer</i>, a memory and storage management tool that enables storage-efficient virtual memory snapshotting while also optimizing snapshot I/O performance. We integrated <italic>Privateer</i> into <italic>Metall</i>, a state-of-the-art persistent memory allocator for C++, and the Lightning Memory-Mapped Database (LMDB), a widely-used key-value datastore in data analytics and machine learning. <italic>Privateer</i> optimized application performance by 1.22× when storing data structure snapshots to node-local storage, and up to 16.7× when storing snapshots to a parallel file system. <italic>Privateer</i> also optimizes storage efficiency of incremental data structure snapshots by up to 11× using data deduplication and compression.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"562-574"},"PeriodicalIF":6.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-23DOI: 10.1109/TPDS.2025.3641049
Hussein Amro;Basel Fakhri;Amer E. Mouawad;Izzat El Hajj
Algorithms for finding minimum or bounded vertex covers in graphs use a branch-and-reduce strategy, which involves exploring a highly imbalanced search tree. Prior GPU solutions assign different thread blocks to different sub-trees, while using a shared worklist to balance the load. However, these prior solutions do not scale to large and complex graphs because their unawareness of when the graph splits into components causes them to solve these components redundantly. Moreover, their high memory footprint limits the number of workers that can execute concurrently. We propose a novel GPU solution for vertex cover problems that detects when a graph splits into components and branches on the components independently. Although the need to aggregate the solutions of different components introduces non-tail-recursive branches which interfere with load balancing, we overcome this challenge by delegating the post-processing to the last descendant of each branch. We also reduce the memory footprint by reducing the graph and inducing a subgraph before exploring the search tree. Our solution substantially outperforms the state-of-the-art GPU solution, finishing in seconds when the state-of-the-art solution exceeds 6 hours. To the best of our knowledge, our work is the first to parallelize non-tail-recursive branching patterns on GPUs in a load balanced manner.
{"title":"Faster Vertex Cover Algorithms on GPUs With Component-Aware Parallel Branching","authors":"Hussein Amro;Basel Fakhri;Amer E. Mouawad;Izzat El Hajj","doi":"10.1109/TPDS.2025.3641049","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3641049","url":null,"abstract":"Algorithms for finding minimum or bounded vertex covers in graphs use a branch-and-reduce strategy, which involves exploring a highly imbalanced search tree. Prior GPU solutions assign different thread blocks to different sub-trees, while using a shared worklist to balance the load. However, these prior solutions do not scale to large and complex graphs because their unawareness of when the graph splits into components causes them to solve these components redundantly. Moreover, their high memory footprint limits the number of workers that can execute concurrently. We propose a novel GPU solution for vertex cover problems that detects when a graph splits into components and branches on the components independently. Although the need to aggregate the solutions of different components introduces non-tail-recursive branches which interfere with load balancing, we overcome this challenge by delegating the post-processing to the last descendant of each branch. We also reduce the memory footprint by reducing the graph and inducing a subgraph before exploring the search tree. Our solution substantially outperforms the state-of-the-art GPU solution, finishing in seconds when the state-of-the-art solution exceeds 6 hours. To the best of our knowledge, our work is the first to parallelize non-tail-recursive branching patterns on GPUs in a load balanced manner.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"504-517"},"PeriodicalIF":6.0,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1109/TPDS.2025.3646119
Marcus Ritter;Benedikt Naumann;Alexandru Calotoiu;Sebastian Rinke;Thorsten Reimann;Torsten Hoefler;Felix Wolf
Performance models help us to understand how HPC applications scale, which is crucial for efficiently utilizing HPC resources. They describe the performance (e.g., runtime) as a function of one or more execution parameters (e.g., problem size and the degree of parallelism). Creating one manually for a given program is challenging and time-consuming. Automatically learning a model from performance data is a viable alternative, but potentially resource-intensive. Extra-P is a tool that implements this approach. The user begins by selecting values for each parameter. Each combination of values defines a possible measurement point. The choice of measurement points affects the quality and cost of the resulting models, creating a complex optimization problem. A naive approach takes measurements for all possible measurement points, the number of which grows exponentially with the number of parameters. In our earlier work, we demonstrated that a quasi-linear number of points is sufficient and that prioritizing the least expensive points is a generic strategy with a good trade-off between cost and quality. Here, we present an improved selection strategy based on Gaussian process regression (GPR) that selects points individually for each modeling task. In our synthetic evaluation, which was based on tens of thousands of artificially generated functions, the naive approach achieved 66% accuracy with two model parameters and 5% artificial noise. At only 10% of the naïve approach’s cost, the generic approach already achieved 47.3% accuracy, while the GPR-based approach achieved even 77.8% accuracy. Similar improvements were observed in experiments involving different numbers of model parameters and noise levels, as well as in case studies with realistic applications.
{"title":"Cost-Effective Empirical Performance Modeling","authors":"Marcus Ritter;Benedikt Naumann;Alexandru Calotoiu;Sebastian Rinke;Thorsten Reimann;Torsten Hoefler;Felix Wolf","doi":"10.1109/TPDS.2025.3646119","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3646119","url":null,"abstract":"Performance models help us to understand how HPC applications scale, which is crucial for efficiently utilizing HPC resources. They describe the performance (e.g., runtime) as a function of one or more execution parameters (e.g., problem size and the degree of parallelism). Creating one manually for a given program is challenging and time-consuming. Automatically learning a model from performance data is a viable alternative, but potentially resource-intensive. Extra-P is a tool that implements this approach. The user begins by selecting values for each parameter. Each combination of values defines a possible measurement point. The choice of measurement points affects the quality and cost of the resulting models, creating a complex optimization problem. A naive approach takes measurements for all possible measurement points, the number of which grows exponentially with the number of parameters. In our earlier work, we demonstrated that a quasi-linear number of points is sufficient and that prioritizing the least expensive points is a generic strategy with a good trade-off between cost and quality. Here, we present an improved selection strategy based on Gaussian process regression (GPR) that selects points individually for each modeling task. In our synthetic evaluation, which was based on tens of thousands of artificially generated functions, the naive approach achieved 66% accuracy with two model parameters and 5% artificial noise. At only 10% of the naïve approach’s cost, the generic approach already achieved 47.3% accuracy, while the GPR-based approach achieved even 77.8% accuracy. Similar improvements were observed in experiments involving different numbers of model parameters and noise levels, as well as in case studies with realistic applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"575-592"},"PeriodicalIF":6.0,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}