首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
MEOCI: Model Partitioning and Early-Exit Point Selection Joint Optimization for Collaborative Inference in Vehicular Edge Computing MEOCI:面向车辆边缘计算协同推理的模型划分和早期退出点选择联合优化
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-12 DOI: 10.1109/TPDS.2026.3652171
Chunlin Li;Jiaqi Wang;Kun Jiang;Cheng Xiong;Shaohua Wan
In recent years, deep neural networks (DNNs) have been widely used in Vehicular Edge Computing (VEC), becoming the core technology for most intelligent applications. However, these DNN inference tasks are usually computation-intensive and latency-sensitive. In urban autonomous driving scenarios, when a large number of vehicles offload tasks to roadside units (RSUs), they face the problem of computational overload of edge servers and inference delay beyond tolerable limits. To address these challenges, we propose an edge-vehicle collaborative inference acceleration mechanism, namely Model partitioning and Early-exit point selection joint Optimization for Collaborative Inference (MEOCI). Specifically, we dynamically select the optimal model partitioning points with the constraint of RSU computing resources and vehicle computing capabilities; and according to the accuracy threshold set to choose the appropriate early exit point. The goal is to minimize the average inference delay under the inference accuracy constraint. Therefore, we propose the Adaptive Dual-Pool Dueling Double Deep Q-Network (ADP-D3QN) algorithm, which enhances the exploration strategy and experience replay mechanism of D3QN to implement the proposed optimization mechanism MEOCI. We conduct comprehensive performance evaluations using four DNN models: AlexNet, VGG16, ResNet50, YOLOv10n. Experimental results show the proposed ADP-D3QN algorithm reduces average inference delay by 15.8% for AlexNet and 8.7% for VGG16 compared to baseline algorithm.
近年来,深度神经网络(dnn)在车辆边缘计算(VEC)中得到了广泛的应用,成为大多数智能应用的核心技术。然而,这些DNN推理任务通常是计算密集型和延迟敏感的。在城市自动驾驶场景中,当大量车辆将任务卸载到路边单元(rsu)时,它们面临边缘服务器计算过载和推理延迟超出可容忍范围的问题。为了解决这些问题,我们提出了一种边缘车辆协同推理加速机制,即模型划分和早期退出点选择联合优化协同推理(MEOCI)。具体而言,在RSU计算资源和车辆计算能力约束下,动态选择最优模型划分点;并根据设定的精度阈值选择合适的提前退出点。目标是在推理精度约束下最小化平均推理延迟。为此,我们提出了自适应双池决斗双深度q网络(ADP-D3QN)算法,该算法增强了D3QN的探索策略和经验重放机制,以实现所提出的优化机制MEOCI。我们使用四个DNN模型:AlexNet, VGG16, ResNet50, YOLOv10n进行综合性能评估。实验结果表明,与基线算法相比,ADP-D3QN算法可将AlexNet的平均推理延迟降低15.8%,将VGG16的平均推理延迟降低8.7%。
{"title":"MEOCI: Model Partitioning and Early-Exit Point Selection Joint Optimization for Collaborative Inference in Vehicular Edge Computing","authors":"Chunlin Li;Jiaqi Wang;Kun Jiang;Cheng Xiong;Shaohua Wan","doi":"10.1109/TPDS.2026.3652171","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3652171","url":null,"abstract":"In recent years, deep neural networks (DNNs) have been widely used in Vehicular Edge Computing (VEC), becoming the core technology for most intelligent applications. However, these DNN inference tasks are usually computation-intensive and latency-sensitive. In urban autonomous driving scenarios, when a large number of vehicles offload tasks to roadside units (RSUs), they face the problem of computational overload of edge servers and inference delay beyond tolerable limits. To address these challenges, we propose an edge-vehicle collaborative inference acceleration mechanism, namely Model partitioning and Early-exit point selection joint Optimization for Collaborative Inference (MEOCI). Specifically, we dynamically select the optimal model partitioning points with the constraint of RSU computing resources and vehicle computing capabilities; and according to the accuracy threshold set to choose the appropriate early exit point. The goal is to minimize the average inference delay under the inference accuracy constraint. Therefore, we propose the Adaptive Dual-Pool Dueling Double Deep Q-Network (ADP-D3QN) algorithm, which enhances the exploration strategy and experience replay mechanism of D3QN to implement the proposed optimization mechanism MEOCI. We conduct comprehensive performance evaluations using four DNN models: AlexNet, VGG16, ResNet50, YOLOv10n. Experimental results show the proposed ADP-D3QN algorithm reduces average inference delay by 15.8% for AlexNet and 8.7% for VGG16 compared to baseline algorithm.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"666-679"},"PeriodicalIF":6.0,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Minimizing Communications of Quantum Circuit Simulations on Distributed Systems 分布式系统中量子电路模拟的最小通信
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-12 DOI: 10.1109/TPDS.2026.3652733
Longshan Xu;Edwin Hsing-Mean Sha;Yuhong Song;Yunfan Chi;Qingfeng Zhuge
Efficient full-state quantum circuit simulations are useful tools for the design of quantum algorithms. Multi-node distributed systems are commonly employed as such simulations require a large amount of computation power and memory space. In distributed systems, communication overhead can be the performance bottleneck. This paper presents a distributed simulation framework called QuanTrans. A quantum circuit is composed of many levels of quantum gates. The simulation is conducted level by level. For circuits with particular structures, it employs a hybrid simulation approach to replace intermediate multi-level communications with one level of final merge operation, whose communication volume is comparable to that of one level of simulation in previous work. A circuit without such structures is sliced to find applicable sub-circuits with a single or multiple consecutive level(s). One level of communication is required for each sub-circuit, so we further propose a polynomial-time optimal circuit slicing algorithm. It can transform any circuit such that the number of sliced sub-circuits is the minimum after transformation. Experimental results show that QuanTrans can effectively reduce communication time and simulation time.
高效的全态量子电路仿真是设计量子算法的有效工具。通常采用多节点分布式系统,因为这种模拟需要大量的计算能力和内存空间。在分布式系统中,通信开销可能成为性能瓶颈。本文提出了一种名为QuanTrans的分布式仿真框架。量子电路由多层量子门组成。仿真是逐级进行的。对于具有特定结构的电路,采用混合仿真方法,用一级最终合并操作代替中间多级通信,其通信量与以往工作中一级仿真的通信量相当。没有这种结构的电路被切片以找到具有单个或多个连续电平的适用子电路。由于每个子电路需要一级通信,因此我们进一步提出了一种多项式时间最优电路切片算法。它可以变换任何电路,使变换后的切片子电路数量最少。实验结果表明,QuanTrans可以有效地减少通信时间和仿真时间。
{"title":"Minimizing Communications of Quantum Circuit Simulations on Distributed Systems","authors":"Longshan Xu;Edwin Hsing-Mean Sha;Yuhong Song;Yunfan Chi;Qingfeng Zhuge","doi":"10.1109/TPDS.2026.3652733","DOIUrl":"https://doi.org/10.1109/TPDS.2026.3652733","url":null,"abstract":"Efficient full-state quantum circuit simulations are useful tools for the design of quantum algorithms. Multi-node distributed systems are commonly employed as such simulations require a large amount of computation power and memory space. In distributed systems, communication overhead can be the performance bottleneck. This paper presents a distributed simulation framework called QuanTrans. A quantum circuit is composed of many levels of quantum gates. The simulation is conducted level by level. For circuits with particular structures, it employs a hybrid simulation approach to replace intermediate multi-level communications with one level of final merge operation, whose communication volume is comparable to that of one level of simulation in previous work. A circuit without such structures is sliced to find applicable sub-circuits with a single or multiple consecutive level(s). One level of communication is required for each sub-circuit, so we further propose a polynomial-time optimal circuit slicing algorithm. It can transform any circuit such that the number of sliced sub-circuits is the minimum after transformation. Experimental results show that QuanTrans can effectively reduce communication time and simulation time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 4","pages":"775-786"},"PeriodicalIF":6.0,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FairGFL: Privacy-Preserving Fairness-Aware Federated Learning With Overlapping Subgraphs FairGFL:具有重叠子图的隐私保护公平感知联邦学习
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-06 DOI: 10.1109/TPDS.2025.3649863
Zihao Zhou;Shusen Yang;Fangyuan Zhao;Xuebin Ren
Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.
图联邦学习支持从分布式子图中协作提取高阶信息,同时保护原始数据的隐私。然而,图形数据经常在不同的客户机之间显示重叠。以前的研究已经证明了重叠数据在减轻数据异质性方面的某些好处。然而,负面影响尚未得到探讨,特别是在客户之间重叠不平衡的情况下。本文通过实证观察和理论推理,揭示了不平衡重叠子图引起的不公平问题。为了解决这个问题,我们提出了FairGFL(公平感知子图联邦学习),这是一种新的算法,可以增强跨客户端公平性,同时以保护隐私的方式保持模型效用。具体来说,FairGFL结合了一种可解释的加权聚合方法,以增强客户端的公平性,利用他们重叠比率的隐私保护估计。此外,FairGFL通过将精心设计的正则化器集成到联邦复合损失函数中,改善了模型效用和公平性之间的权衡。通过在四个基准图数据集上的大量实验,我们证明FairGFL在模型效用和公平性方面优于四种代表性基线算法。
{"title":"FairGFL: Privacy-Preserving Fairness-Aware Federated Learning With Overlapping Subgraphs","authors":"Zihao Zhou;Shusen Yang;Fangyuan Zhao;Xuebin Ren","doi":"10.1109/TPDS.2025.3649863","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3649863","url":null,"abstract":"Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"710-725"},"PeriodicalIF":6.0,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMEStencil: Optimizing High-Order Stencils on ARM Multicore Using SME Unit SMEStencil:使用SME单元在ARM多核上优化高阶模板
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-05 DOI: 10.1109/TPDS.2025.3650515
Yinuo Wang;Tianqi Mao;Lin Gan;Wubing Wan;Zeyu Song;Jiayu Fu;Lanke He;Wenqiang Wang;Zekun Yin;Wei Xue;Guangwen Yang
Matrix-accelerated stencil computation is a hot research topic, yet its application to 3 dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of Scalable Matrix Extension(SME) on ARMv9-A CPU, we analyze SME-based accelerating strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on Scalable Vector Extension(SVE) and SME unit to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. SMEStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, SMEStencil outperforms state-of-the-art libraries on Nividia A100 GPGPU by up to 2.1×. Moreover, the performance improvements enabled by our optimizations translate directly to real-world HPC applications and enable Reverse Time Migration(RTM) real-world applications to yield 1.8x speedup versus highly-optimized Nvidia A100 GPGPU version.
矩阵加速模板计算是一个研究热点,但其在三维高阶模板和高性能计算中的应用还有待探索。随着可扩展矩阵扩展(SME)在ARMv9-A CPU上的出现,我们分析了基于SME的加速策略,并为3D高阶模板定制了一种优化方法。我们引入了基于可扩展向量扩展(SVE)和SME单元的算法优化来解决跨行内存访问、对齐冲突和冗余访问。我们提出了内存优化来提高包内内存效率,并提出了一种新的多线程并行模式来克服由于缺乏共享数据缓存而导致的数据共享挑战。SMEStencil在不同的模板形状和尺寸上保持一致的高硬件利用率。我们基于dma的NUMA间通信进一步减轻了混合并行中的NUMA效应和MPI限制。结合所有的创新,SMEStencil在nvidia A100 GPGPU上的性能比最先进的库高出2.1倍。此外,通过我们的优化实现的性能改进直接转化为现实世界的HPC应用程序,并使反向时间迁移(RTM)现实世界的应用程序与高度优化的Nvidia A100 GPGPU版本相比产生1.8倍的加速。
{"title":"SMEStencil: Optimizing High-Order Stencils on ARM Multicore Using SME Unit","authors":"Yinuo Wang;Tianqi Mao;Lin Gan;Wubing Wan;Zeyu Song;Jiayu Fu;Lanke He;Wenqiang Wang;Zekun Yin;Wei Xue;Guangwen Yang","doi":"10.1109/TPDS.2025.3650515","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3650515","url":null,"abstract":"Matrix-accelerated stencil computation is a hot research topic, yet its application to 3 dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of Scalable Matrix Extension(SME) on ARMv9-A CPU, we analyze SME-based accelerating strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on Scalable Vector Extension(SVE) and SME unit to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. SMEStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, SMEStencil outperforms state-of-the-art libraries on Nividia A100 GPGPU by up to 2.1×. Moreover, the performance improvements enabled by our optimizations translate directly to real-world HPC applications and enable Reverse Time Migration(RTM) real-world applications to yield 1.8x speedup versus highly-optimized Nvidia A100 GPGPU version.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"651-665"},"PeriodicalIF":6.0,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Parameter Tuning in Distributed Storage Systems via Knowledge Graph Query 基于知识图查询的分布式存储系统参数调优思考
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-05 DOI: 10.1109/TPDS.2025.3650593
Wang Zhang;Hongyu Wang;Zhan Shi;Yutong Wu;Mingjin Li;Tingfang Li;Fang Wang;Dan Feng
The growing volume of performance-critical parameters in distributed storage systems, coupled with diverse and dynamic workload patterns, has significantly increased the complexity of system configuration. These trends have expanded the parameter space while tightening the time window for tuning convergence, making it challenging to maintain high system performance. Existing tuning strategies often struggle to balance thorough parameter exploration with real-time responsiveness, limiting their effectiveness under fast-evolving workloads and heterogeneous deployment environments. To address these challenges, we propose KGQW, the first framework that formulates automated parameter tuning as a knowledge graph query workflow. KGQW models workload features and system parameters as graph vertices, with performance metrics represented as edges, and constructs an initial knowledge graph through lightweight performance tests. Guided by performance prediction and Bayesian-driven exploration, KGQW progressively expands the graph, prunes insensitive parameters, and refines performance relationships to build an informative and reusable knowledge graph that supports rapid configuration retrieval via graph querying. Moreover, KGQW enables efficient knowledge transfer across clusters, substantially reducing the construction cost for new clusters. Experiments on real-world applications and storage clusters demonstrate that KGQW achieves second-level tuning latency, while maintaining or surpassing the performance of state-of-the-art methods. These results highlight the promise of knowledge-driven tuning in meeting the scalability and adaptability demands of modern distributed storage systems.
分布式存储系统中越来越多的性能关键参数,加上多样化和动态的工作负载模式,大大增加了系统配置的复杂性。这些趋势扩大了参数空间,同时收紧了调整收敛的时间窗口,使保持高系统性能变得具有挑战性。现有的调优策略常常难以平衡全面的参数探索和实时响应,这限制了它们在快速发展的工作负载和异构部署环境下的有效性。为了应对这些挑战,我们提出了KGQW,这是第一个将自动参数调优作为知识图查询工作流的框架。KGQW将工作负载特征和系统参数建模为图顶点,将性能指标表示为边,并通过轻量级性能测试构建初始知识图。在性能预测和贝叶斯驱动探索的指导下,KGQW逐步扩展图,去除不敏感的参数,细化性能关系,构建一个信息丰富、可重用的知识图,支持通过图查询快速检索配置。此外,KGQW使知识在集群之间有效转移,大大降低了新集群的建设成本。在实际应用程序和存储集群上的实验表明,KGQW实现了秒级调优延迟,同时保持或超过了最先进方法的性能。这些结果突出了知识驱动调优在满足现代分布式存储系统的可扩展性和适应性需求方面的前景。
{"title":"Rethinking Parameter Tuning in Distributed Storage Systems via Knowledge Graph Query","authors":"Wang Zhang;Hongyu Wang;Zhan Shi;Yutong Wu;Mingjin Li;Tingfang Li;Fang Wang;Dan Feng","doi":"10.1109/TPDS.2025.3650593","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3650593","url":null,"abstract":"The growing volume of performance-critical parameters in distributed storage systems, coupled with diverse and dynamic workload patterns, has significantly increased the complexity of system configuration. These trends have expanded the parameter space while tightening the time window for tuning convergence, making it challenging to maintain high system performance. Existing tuning strategies often struggle to balance thorough parameter exploration with real-time responsiveness, limiting their effectiveness under fast-evolving workloads and heterogeneous deployment environments. To address these challenges, we propose KGQW, the first framework that formulates automated parameter tuning as a knowledge graph query workflow. KGQW models workload features and system parameters as graph vertices, with performance metrics represented as edges, and constructs an initial knowledge graph through lightweight performance tests. Guided by performance prediction and Bayesian-driven exploration, KGQW progressively expands the graph, prunes insensitive parameters, and refines performance relationships to build an informative and reusable knowledge graph that supports rapid configuration retrieval via graph querying. Moreover, KGQW enables efficient knowledge transfer across clusters, substantially reducing the construction cost for new clusters. Experiments on real-world applications and storage clusters demonstrate that KGQW achieves second-level tuning latency, while maintaining or surpassing the performance of state-of-the-art methods. These results highlight the promise of knowledge-driven tuning in meeting the scalability and adaptability demands of modern distributed storage systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"633-650"},"PeriodicalIF":6.0,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing Management of Persistent Data Structures in High-Performance Analytics 高性能分析中持久数据结构的优化管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-31 DOI: 10.1109/TPDS.2025.3646133
Karim Youssef;Keita Iwabuchi;Maya Gokhale;Wu-chun Feng;Roger Pearce
Large-scale data analytics workflows ingest massive input data into various data structures, including graphs and key-value datastores. These data structures undergo multiple transformations and computations and are typically reused in incremental and iterative analytics workflows. Persisting in-memory views of these data structures enables reusing them beyond the scope of a single program run while avoiding repetitive raw data ingestion overheads. Memory-mapped I/O enables persisting in-memory data structures without data serialization and deserialization overheads. However, memory-mapped I/O lacks the key feature of persisting consistent snapshots of these data structures for incremental ingestion and processing. The obstacles to efficient virtual memory snapshots using memory-mapped I/O include background writebacks outside the application’s control, and the significantly high storage footprint of such snapshots. To address these limitations, we present Privateer, a memory and storage management tool that enables storage-efficient virtual memory snapshotting while also optimizing snapshot I/O performance. We integrated Privateer into Metall, a state-of-the-art persistent memory allocator for C++, and the Lightning Memory-Mapped Database (LMDB), a widely-used key-value datastore in data analytics and machine learning. Privateer optimized application performance by 1.22× when storing data structure snapshots to node-local storage, and up to 16.7× when storing snapshots to a parallel file system. Privateer also optimizes storage efficiency of incremental data structure snapshots by up to 11× using data deduplication and compression.
大规模数据分析工作流将大量输入数据摄取到各种数据结构中,包括图形和键值数据存储。这些数据结构经历多次转换和计算,并且通常在增量和迭代分析工作流中重用。在内存中持久化这些数据结构的视图可以在单个程序运行范围之外重用它们,同时避免重复的原始数据摄取开销。内存映射I/O支持在没有数据序列化和反序列化开销的情况下持久化内存中的数据结构。然而,内存映射I/O缺乏为增量摄取和处理持久化这些数据结构的一致快照的关键特性。使用内存映射I/O实现高效虚拟内存快照的障碍包括应用程序控制之外的后台写回,以及此类快照的高存储占用。为了解决这些限制,我们提出了Privateer,这是一个内存和存储管理工具,可以实现存储效率高的虚拟内存快照,同时还可以优化快照I/O性能。我们将Privateer集成到Metall(面向c++的最先进的持久内存分配器)和Lightning memory - mapped Database (LMDB)(在数据分析和机器学习中广泛使用的键值数据存储)中。当将数据结构快照存储到节点本地存储时,Privateer将应用程序性能优化了1.22倍,当将快照存储到并行文件系统时,性能优化了16.7倍。Privateer还通过重复数据删除和压缩,将增量数据结构快照的存储效率提高了11倍。
{"title":"Optimizing Management of Persistent Data Structures in High-Performance Analytics","authors":"Karim Youssef;Keita Iwabuchi;Maya Gokhale;Wu-chun Feng;Roger Pearce","doi":"10.1109/TPDS.2025.3646133","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3646133","url":null,"abstract":"Large-scale data analytics workflows ingest massive input data into various data structures, including graphs and key-value datastores. These data structures undergo multiple transformations and computations and are typically reused in incremental and iterative analytics workflows. Persisting in-memory views of these data structures enables reusing them beyond the scope of a single program run while avoiding repetitive raw data ingestion overheads. Memory-mapped I/O enables persisting in-memory data structures without data serialization and deserialization overheads. However, memory-mapped I/O lacks the key feature of persisting consistent snapshots of these data structures for incremental ingestion and processing. The obstacles to efficient virtual memory snapshots using memory-mapped I/O include background writebacks outside the application’s control, and the significantly high storage footprint of such snapshots. To address these limitations, we present <italic>Privateer</i>, a memory and storage management tool that enables storage-efficient virtual memory snapshotting while also optimizing snapshot I/O performance. We integrated <italic>Privateer</i> into <italic>Metall</i>, a state-of-the-art persistent memory allocator for C++, and the Lightning Memory-Mapped Database (LMDB), a widely-used key-value datastore in data analytics and machine learning. <italic>Privateer</i> optimized application performance by 1.22× when storing data structure snapshots to node-local storage, and up to 16.7× when storing snapshots to a parallel file system. <italic>Privateer</i> also optimizes storage efficiency of incremental data structure snapshots by up to 11× using data deduplication and compression.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"562-574"},"PeriodicalIF":6.0,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster Vertex Cover Algorithms on GPUs With Component-Aware Parallel Branching 基于组件感知并行分支的gpu上更快的顶点覆盖算法
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-23 DOI: 10.1109/TPDS.2025.3641049
Hussein Amro;Basel Fakhri;Amer E. Mouawad;Izzat El Hajj
Algorithms for finding minimum or bounded vertex covers in graphs use a branch-and-reduce strategy, which involves exploring a highly imbalanced search tree. Prior GPU solutions assign different thread blocks to different sub-trees, while using a shared worklist to balance the load. However, these prior solutions do not scale to large and complex graphs because their unawareness of when the graph splits into components causes them to solve these components redundantly. Moreover, their high memory footprint limits the number of workers that can execute concurrently. We propose a novel GPU solution for vertex cover problems that detects when a graph splits into components and branches on the components independently. Although the need to aggregate the solutions of different components introduces non-tail-recursive branches which interfere with load balancing, we overcome this challenge by delegating the post-processing to the last descendant of each branch. We also reduce the memory footprint by reducing the graph and inducing a subgraph before exploring the search tree. Our solution substantially outperforms the state-of-the-art GPU solution, finishing in seconds when the state-of-the-art solution exceeds 6 hours. To the best of our knowledge, our work is the first to parallelize non-tail-recursive branching patterns on GPUs in a load balanced manner.
在图中寻找最小或有界顶点覆盖的算法使用分支约简策略,该策略涉及探索高度不平衡的搜索树。先前的GPU解决方案将不同的线程块分配到不同的子树,同时使用共享工作列表来平衡负载。然而,这些先前的解决方案不能扩展到大型和复杂的图,因为它们不知道图何时分裂成组件,导致它们冗余地解决这些组件。此外,它们的高内存占用限制了可以并发执行的工作线程的数量。我们为顶点覆盖问题提出了一种新的GPU解决方案,该解决方案可以检测图何时分裂为组件并在组件上独立分支。尽管需要聚合不同组件的解决方案会引入干扰负载平衡的非尾递归分支,但我们通过将后处理委托给每个分支的最后后代来克服这一挑战。我们还通过在探索搜索树之前减少图和诱导子图来减少内存占用。我们的解决方案大大优于最先进的GPU解决方案,当最先进的解决方案超过6小时时,只需几秒钟即可完成。据我们所知,我们的工作是第一个以负载平衡的方式在gpu上并行化非尾递归分支模式的工作。
{"title":"Faster Vertex Cover Algorithms on GPUs With Component-Aware Parallel Branching","authors":"Hussein Amro;Basel Fakhri;Amer E. Mouawad;Izzat El Hajj","doi":"10.1109/TPDS.2025.3641049","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3641049","url":null,"abstract":"Algorithms for finding minimum or bounded vertex covers in graphs use a branch-and-reduce strategy, which involves exploring a highly imbalanced search tree. Prior GPU solutions assign different thread blocks to different sub-trees, while using a shared worklist to balance the load. However, these prior solutions do not scale to large and complex graphs because their unawareness of when the graph splits into components causes them to solve these components redundantly. Moreover, their high memory footprint limits the number of workers that can execute concurrently. We propose a novel GPU solution for vertex cover problems that detects when a graph splits into components and branches on the components independently. Although the need to aggregate the solutions of different components introduces non-tail-recursive branches which interfere with load balancing, we overcome this challenge by delegating the post-processing to the last descendant of each branch. We also reduce the memory footprint by reducing the graph and inducing a subgraph before exploring the search tree. Our solution substantially outperforms the state-of-the-art GPU solution, finishing in seconds when the state-of-the-art solution exceeds 6 hours. To the best of our knowledge, our work is the first to parallelize non-tail-recursive branching patterns on GPUs in a load balanced manner.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"504-517"},"PeriodicalIF":6.0,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cost-Effective Empirical Performance Modeling 成本效益实证绩效模型
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-18 DOI: 10.1109/TPDS.2025.3646119
Marcus Ritter;Benedikt Naumann;Alexandru Calotoiu;Sebastian Rinke;Thorsten Reimann;Torsten Hoefler;Felix Wolf
Performance models help us to understand how HPC applications scale, which is crucial for efficiently utilizing HPC resources. They describe the performance (e.g., runtime) as a function of one or more execution parameters (e.g., problem size and the degree of parallelism). Creating one manually for a given program is challenging and time-consuming. Automatically learning a model from performance data is a viable alternative, but potentially resource-intensive. Extra-P is a tool that implements this approach. The user begins by selecting values for each parameter. Each combination of values defines a possible measurement point. The choice of measurement points affects the quality and cost of the resulting models, creating a complex optimization problem. A naive approach takes measurements for all possible measurement points, the number of which grows exponentially with the number of parameters. In our earlier work, we demonstrated that a quasi-linear number of points is sufficient and that prioritizing the least expensive points is a generic strategy with a good trade-off between cost and quality. Here, we present an improved selection strategy based on Gaussian process regression (GPR) that selects points individually for each modeling task. In our synthetic evaluation, which was based on tens of thousands of artificially generated functions, the naive approach achieved 66% accuracy with two model parameters and 5% artificial noise. At only 10% of the naïve approach’s cost, the generic approach already achieved 47.3% accuracy, while the GPR-based approach achieved even 77.8% accuracy. Similar improvements were observed in experiments involving different numbers of model parameters and noise levels, as well as in case studies with realistic applications.
性能模型帮助我们理解HPC应用程序如何扩展,这对于高效利用HPC资源至关重要。它们将性能(例如,运行时间)描述为一个或多个执行参数(例如,问题大小和并行度)的函数。为给定的程序手动创建一个是具有挑战性和耗时的。从性能数据中自动学习模型是一种可行的替代方案,但可能会占用大量资源。Extra-P是实现这种方法的工具。用户首先为每个参数选择值。每个值的组合定义了一个可能的测量点。测量点的选择会影响最终模型的质量和成本,从而产生一个复杂的优化问题。一种朴素的方法是对所有可能的测量点进行测量,测量点的数量随着参数的数量呈指数增长。在我们早期的工作中,我们证明了准线性数量的点是足够的,并且优先考虑最便宜的点是一种在成本和质量之间良好权衡的通用策略。在这里,我们提出了一种改进的基于高斯过程回归(GPR)的选择策略,该策略为每个建模任务单独选择点。在我们基于数万个人工生成函数的综合评估中,朴素方法在两个模型参数和5%人工噪声的情况下达到66%的准确率。通用方法的成本仅为naïve方法的10%,但准确率已经达到47.3%,而基于gpr的方法甚至达到了77.8%的准确率。在涉及不同数量的模型参数和噪声水平的实验以及具有实际应用的案例研究中,也观察到类似的改进。
{"title":"Cost-Effective Empirical Performance Modeling","authors":"Marcus Ritter;Benedikt Naumann;Alexandru Calotoiu;Sebastian Rinke;Thorsten Reimann;Torsten Hoefler;Felix Wolf","doi":"10.1109/TPDS.2025.3646119","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3646119","url":null,"abstract":"Performance models help us to understand how HPC applications scale, which is crucial for efficiently utilizing HPC resources. They describe the performance (e.g., runtime) as a function of one or more execution parameters (e.g., problem size and the degree of parallelism). Creating one manually for a given program is challenging and time-consuming. Automatically learning a model from performance data is a viable alternative, but potentially resource-intensive. Extra-P is a tool that implements this approach. The user begins by selecting values for each parameter. Each combination of values defines a possible measurement point. The choice of measurement points affects the quality and cost of the resulting models, creating a complex optimization problem. A naive approach takes measurements for all possible measurement points, the number of which grows exponentially with the number of parameters. In our earlier work, we demonstrated that a quasi-linear number of points is sufficient and that prioritizing the least expensive points is a generic strategy with a good trade-off between cost and quality. Here, we present an improved selection strategy based on Gaussian process regression (GPR) that selects points individually for each modeling task. In our synthetic evaluation, which was based on tens of thousands of artificially generated functions, the naive approach achieved 66% accuracy with two model parameters and 5% artificial noise. At only 10% of the naïve approach’s cost, the generic approach already achieved 47.3% accuracy, while the GPR-based approach achieved even 77.8% accuracy. Similar improvements were observed in experiments involving different numbers of model parameters and noise levels, as well as in case studies with realistic applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"575-592"},"PeriodicalIF":6.0,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Computational Burst Buffers: Accelerating HPC I/O via In-Storage Compression Offloading 计算突发缓冲:通过存储压缩卸载加速HPC I/O
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-11 DOI: 10.1109/TPDS.2025.3643175
Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu
Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce Computational Burst Buffers (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.
突发缓冲区(Burst buffers)作为计算节点和并行文件系统(PFS)之间的中间存储层,可以有效缓解高性能计算(HPC)中的I/O性能差距。随着科学模拟和人工智能工作负载产生更大的检查点和分析输出,BB容量短缺和PFS带宽瓶颈正在出现,基于cpu的压缩由于其高开销而不是有效的解决方案。我们介绍了计算突发缓冲区(CBBs),这是一种存储范例,它将硬件压缩引擎(如专用集成电路(ASIC))嵌入到BB层的计算存储驱动器(csd)中。CBB透明地将无损和错误有界的有损压缩从cpu卸载到csd,从而(i)扩展ssd支持的有效BB容量,(ii)减少BB - pfs流量,以及(iii)消除基于cpu的压缩的争用和能源开销。与之前针对数据库或闪存缓存的基于CSD的压缩设计不同,CBB为HPC共同设计突发缓冲层和CSD硬件,并定量评估BB-PFS层次结构中的压缩卸载。我们使用带有ASIC zstd类压缩器的PCIe 5.0 CSD和SZ熵编码器的FPGA原型对CBB进行了原型设计,并在16节点集群上对CBB进行了评估。在四个典型HPC应用程序和一个大型工作流模拟器上进行的实验表明,与软件压缩和传统BBs相比,应用程序运行时间降低了61%,缓存命中率提高了8 - 12倍,并且大大降低了计算节点的CPU利用率。这些结果表明,带有csd的压缩感知型bsd为下一代高性能计算存储提供了一条实用的、可扩展的途径。
{"title":"Computational Burst Buffers: Accelerating HPC I/O via In-Storage Compression Offloading","authors":"Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu","doi":"10.1109/TPDS.2025.3643175","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3643175","url":null,"abstract":"Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce <underline>Computational Burst Buffers</u> (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"518-532"},"PeriodicalIF":6.0,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Survey on Machine Learning-Based HPC I/O Analysis and Optimization 基于机器学习的高性能计算I/O分析与优化研究综述
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-03 DOI: 10.1109/TPDS.2025.3639682
Jingxian Peng;Lihua Yang;Huijun Wu;Wenzhe Zhang;Zhenwei Wu;Wei Zhang;Jiaxin Li;Yiqin Dai;Yong Dong
The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.
高性能计算系统的计算能力不断提高,支持大量大规模应用程序,这些应用程序产生大量数据量和各种I/O模式,导致严重的I/O瓶颈。因此,分析和优化HPC I/O至关重要。然而,传统的方法通常是定制的,缺乏应对高性能计算环境中动态变化所需的适应性。为了应对这一挑战,机器学习(ML)已被越来越多地用于自动化和增强I/O分析和优化。给定来自HPC系统的足够的I/O跟踪,ML可以学习底层的I/O行为,提取可操作的见解,并动态适应不断变化的工作负载以提高性能。在这项调查中,我们提出了一种新的分类法,将HPC I/O问题与学习任务相结合,以系统地回顾现有的研究。通过这个分类法,我们综合了研究分布、数据准备和模型选择的关键发现。最后,我们讨论了在HPC I/O系统中推进ML有效集成的几个方向。
{"title":"A Survey on Machine Learning-Based HPC I/O Analysis and Optimization","authors":"Jingxian Peng;Lihua Yang;Huijun Wu;Wenzhe Zhang;Zhenwei Wu;Wei Zhang;Jiaxin Li;Yiqin Dai;Yong Dong","doi":"10.1109/TPDS.2025.3639682","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639682","url":null,"abstract":"The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"618-632"},"PeriodicalIF":6.0,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1