首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Fully Decentralized Data Distribution for Large-Scale HPC Systems 大规模高性能计算系统的完全分散数据分布
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-17 DOI: 10.1109/TPDS.2025.3633298
Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu
For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.
多年来,在HPC数据分布场景中,随着HPC系统规模的不断增加,厂商不得不增加数据提供者的数量,以提高IO并行性来匹配数据需求者。在大规模的,特别是百亿亿次的HPC系统中,这种将需求者和提供者分离的模式存在着显著的可扩展性限制,并带来了巨大的成本。我们认为,只有需求方同时充当供方的分布模式才能从根本上应对规模的变化,并具有最佳的可扩展性,本文称之为all-to-all数据分布模式。我们在HPC系统的计算网络上设计并实现了BitTorrent协议,并提出了一种完全去中心化的数据分发方法FD3。为了提高FD3的性能,我们根据HPC网络环境的特点设计了请求验证表(request -to- validated Table, RVT)以及最高排名和最长连续段优先(HLF)策略。此外,我们设计了种子树来加速种子文件数据的分发和分布状态的聚合,并利用邻域局部生成算法释放跟踪器负载。实验结果表明,FD3可以平滑地扩展到11k+计算节点,其性能远远优于并行文件系统。与原版BitTorrent相比,性能提升了8-15倍。FD3强调了全对全模型在高性能计算数据分布场景中的巨大潜力。此外,本文的工作可以进一步激发对未来分布式并行文件系统的探索,并为Exscale HPC系统的数据访问模式设计提供基础和启发。
{"title":"Fully Decentralized Data Distribution for Large-Scale HPC Systems","authors":"Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu","doi":"10.1109/TPDS.2025.3633298","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3633298","url":null,"abstract":"For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"304-321"},"PeriodicalIF":6.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM DAHBM-GCN:具有多数据流和HBM的灵活图卷积网络加速器
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-12 DOI: 10.1109/TPDS.2025.3632073
Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li
Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.
图结构数据在交通网络、分子网络、电子商务网络等领域得到了广泛的应用。图卷积网络(GCN)已成为处理非欧几里得图数据的一种有效方法。然而,图数据集的大小和稀疏性的变化,加上GCN计算中数据流模式对图数据的依赖性,使得GCN推理的加速越来越具有挑战性。提出了一种基于多数据流和高带宽存储器(HBM)的GCN推理加速器,命名为DAHBM-GCN。首先,我们设计了一个支持多数据流、聚合优先和组合优先顺序的计算引擎。在此基础上,提出了一种基于决策树的多数据流计算引擎自适应选择器,用于选择最优的数据流计算引擎。其次,为多通道HBM设计了一种有效的伪信道映射(pc),以增强带宽,有效缓解内存延迟和带宽瓶颈。第三,提出了一种GCN混合不动点量化策略,在不损失精度的前提下,降低了GCN模型的计算复杂度和参数数量。最后,广泛的性能评估实验表明,在各种数据集上,DAHBM-GCN与PyG-GCN和DGL-GCN相比,在CPU上的平均速度分别为52.5 - 129.3倍和4.9 - 7.9倍。与基于fpga的AWB-GCN、HyGCN、HLS-GCN和GCNAX加速器相比,DAHBM-GCN在不同数据集上的平均速度分别为1.21-2.21倍、1.25-1.98倍、1.65-2.68倍和1.18-1.56倍。此外,DAHBM-GCN具有高灵活性和低能耗的优点。
{"title":"DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM","authors":"Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2025.3632073","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632073","url":null,"abstract":"Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"213-229"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity HyFaaS:通过释放混合资源弹性来加速无服务器工作流
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-12 DOI: 10.1109/TPDS.2025.3632089
Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers
Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for computation-communication-separated orchestration to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.
无服务器计算保证了细粒度的资源弹性和计费,使其成为将复杂应用程序构建为多阶段工作流的一种有吸引力的方式。尽管如此,现有的工作流编排忽略了阶段内计算和通信部分的异构需求,这可能导致任何一方的资源效率低下。在本文中,我们提倡计算-通信分离的编排,以释放混合资源(即计算和网络)的弹性。我们介绍HyFaaS,这是一种无服务器工作流编排器,可在确保成本效率的同时提高性能。它将计算和通信无缝解耦,作为一系列混合阶段,在HyDAG中重新表达,这是一种新颖的工作流抽象。HyFaaS使用灰盒分析模型来识别其pareto最优饱和配置,然后部署饱和工作流,通过两级HyDAG分区来处理通信和扩展开销。随着事件驱动的运行时微调,HyFaaS进一步缩小了非关键阶段,通过分支感知协调来降低成本。实验结果表明,HyFaaS的端到端延迟比现有方案提高了32.7% ~ 50.4%,同时降低了1.37倍的成本。
{"title":"HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity","authors":"Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers","doi":"10.1109/TPDS.2025.3632089","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632089","url":null,"abstract":"Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for <italic>computation-communication-separated orchestration</i> to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"272-286"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers 高能效地理分布式数据中心任务调度和热管理的双时间尺度优化
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-11 DOI: 10.1109/TPDS.2025.3631654
Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen
The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.
人工智能(AI)的激增加剧了计算密集型任务,急剧增加了对地理分布式数据中心节能管理的需求。由于不匹配的时间常数、随机信息技术(IT)工作负载、可变的可再生能源和波动的电价,现有的方法难以协调任务调度和冷却控制。为了应对这些挑战,我们提出了D3T,这是一个双时间尺度深度强化学习(DRL)框架,可共同优化节能地理分布式数据中心的任务调度和热管理。在快速时间尺度下,D3T采用Deep Q-Network (DQN)来调度任务,降低了运营支出(OPEX)和任务驻留时间。在慢时间尺度下,基于qmix的多智能体DRL方法通过动态调节气流速率来调节分布式数据中心的冷却,从而防止出现热点,减少能源浪费。使用TRNSYS进行了大量的实验,结果表明,与基线算法相比,D3T将IT子系统的OPEX降低了13%,冷却子系统的OPEX降低了29%,将电源使用效率(PUE)提高了7%,并在地理分布式数据中心中保持了更稳定的热安全。
{"title":"D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers","authors":"Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen","doi":"10.1109/TPDS.2025.3631654","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631654","url":null,"abstract":"The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"230-246"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to Evaluate Distributed Coordination Systems?–A Survey and Analysis 如何评估分布式协调系统?——调查与分析
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-11 DOI: 10.1109/TPDS.2025.3631614
Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
协调服务和协议是分布式系统的关键组件,对于提供一致性、容错性和可伸缩性至关重要。然而,由于缺乏针对分布式协调服务的标准基准测试和评估工具,协调服务的开发人员/研究人员要么使用NoSQL标准基准测试,而忽略了一致性、分布和容错性的评估;或者创建自己的特别微基准测试,跳过与其他服务的可比性。在本研究中,我们分析和比较了已知和广泛使用的共识算法、分布式协调服务以及基于这些服务构建的分布式应用程序的评估机制。我们确定了分布式协调服务基准测试的最重要需求,例如用于评估这些系统的性能、可伸缩性、可用性和一致性的度量和参数。最后,我们讨论了为什么现有的基准不能解决分布式协调系统评估的复杂需求。
{"title":"How to Evaluate Distributed Coordination Systems?–A Survey and Analysis","authors":"Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas","doi":"10.1109/TPDS.2025.3631614","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631614","url":null,"abstract":"Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"198-212"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks 基于异构LEO卫星网络的高效分离学习框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-06 DOI: 10.1109/TPDS.2025.3629667
Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao
The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.
低地球轨道(LEO)卫星系统的快速部署推动了各种基于空间的应用(例如,农业监测和灾害响应),这些应用越来越依赖于深度学习(DL)的进步。然而,由于卫星与地面站之间的间歇性连接,地面站无法下载如此大量的原始数据进行集中训练,而在资源受限的卫星上,按比例放大的DL模型对分布式训练构成了很大的障碍。尽管拆分学习(SL)已经成为一种很有前途的解决方案,可以通过模型划分将主要训练工作量转移给GS,同时在卫星上保留原始数据,但卫星与GS之间有限的连通性和卫星资源的异质性仍然是重大障碍。在本文中,我们提出了S-Leon,这是一种专为解决异构LEO卫星网络中的这些挑战而设计的SL框架。我们开发了一个卫星早期退出模型,以消除非接触期间的培训中断,并采用在线知识蒸馏来整合地面知识,进一步加强卫星本地培训。此外,我们还设计了一种同时适应单个卫星异构计算和通信能力的卫星模型定制方法。最后,我们开发了一个局部模型不可知的训练策略,以优化跨定制卫星模型的协同训练效果。在真实的LEO卫星网络上进行的大量实验表明,S-Leon优于最先进的基准。
{"title":"S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks","authors":"Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao","doi":"10.1109/TPDS.2025.3629667","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3629667","url":null,"abstract":"The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"106-121"},"PeriodicalIF":6.0,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching 基于学习的多层次设计使FIFO更适合CDN缓存
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-03 DOI: 10.1109/TPDS.2025.3628547
Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li
With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.
随着人工智能和物联网等数据密集型应用的快速增长,使用持久存储(如ssd和hdd)在边缘缓存数据的cdn对于提高网络效率至关重要。两个指标——命中率和处理延迟——对于评估CDN缓存性能至关重要。然而,CDN缓存面临写放大的挑战,需要在随机访问以获得更高的命中率和顺序写入以减少处理延迟之间进行权衡。现有的缓存设计很难在不同工作负载之间有效地平衡这些相互冲突的需求。在本文中,我们介绍了FOSS,这是一种专门针对部署在基于ssd的存储和混合SSD-HDD存储上的cdn进行优化的缓存系统,其特点是精简,精简的文件系统独立于内核运行。在其核心,FOSS采用多级FIFO队列在ssd上实现本地顺序访问和全局随机访问之间的平衡。然后,采用基于学习的方法动态配置多层结构配置,使系统能够适应各种工作负载特征和缓存算法要求。因此,自由/开源软件确保了跨不同场景的更好性能。我们广泛的实验表明,自由/开源软件显著提高了现有系统的命中率,减少了16.5%的端到端响应延迟,并在大规模商业CDN跟踪的各种设置中展示了一致的性能改进。
{"title":"FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching","authors":"Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li","doi":"10.1109/TPDS.2025.3628547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628547","url":null,"abstract":"With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"155-168"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Puffer: A Serverless Platform Based on Vertical Memory Scaling Puffer:基于垂直内存扩展的无服务器平台
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-03 DOI: 10.1109/TPDS.2025.3628202
Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu
This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.
本文定量分析了垂直扩展microvm在无服务器计算中的潜力。我们的分析表明,在真实的无服务器工作负载下,垂直扩展可以显著提高执行性能和资源利用率。然而,我们也发现microvm的内存扩展是阻碍垂直扩展达到性能上限的瓶颈。我们提出Faascale,一种新的机制,可以有效地为无服务器应用程序扩展microvm的内存。Faascale采用了一系列技术来解决这个瓶颈:1)它通过绑定函数实例而不是普通页面的块来增加/减少MicroVM的内存;2)它为函数实例预填充物理内存,以减少惰性填充带来的延迟。与现有的内存缩放机制相比,Faascale将内存缩放效率提高了2到3个数量级。基于Faascale,我们实现了一个无服务器平台Puffer。在8个无服务器基准函数上进行的实验表明,与水平扩展策略相比,Puffer可使microvm冷启动时间减少89.01%,内存利用率提高17.66%,函数执行时间平均减少23.93%。
{"title":"Puffer: A Serverless Platform Based on Vertical Memory Scaling","authors":"Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu","doi":"10.1109/TPDS.2025.3628202","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628202","url":null,"abstract":"This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"184-197"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223881","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference 基于LLM推理的GPU内存约束下KV缓存溢出管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3626974
Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu
The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.
模型参数的快速增长对在GPU上部署大型生成模型提出了重大挑战。现有的LLM运行时内存管理解决方案倾向于最大化批处理大小以饱和GPU设备利用率。然而,这种做法会导致某些序列的KV缓存在模型推理期间无法容纳在内存容量有限的GPU上,需要暂时从GPU内存中移除(称为KV缓存溢出)。然而,由于没有仔细考虑LLM推理的运行时模式,目前的LLM推理内存管理解决方案面临着诸如针对不同平台的一贯性溢出处理方法,预填充阶段GPU利用率不足以及由于直接使用交换或重计算而导致的次优序列选择等问题。在本文中,我们介绍了FuseSpill,一个完整的KV缓存管理解决方案,旨在通过有效处理KV缓存溢出来提高内存受限GPU上的LLM推理。具体来说,FuseSpill包括一个溢出成本模型,定量分析溢出处理技术的系统成本,一个KV缓存交换编排器,进一步完善基本交换技术,在异构设备上对KV缓存进行复杂的解码迭代,一个多执行器调度程序,有效地协调跨设备的任务执行器,以及响应长度预测器,用于在KV缓存溢出发生时利用长度感知序列选择策略。实验结果表明,我们的实现优于现有的解决方案,在减少溢出序列推理延迟的同时,吞吐量提高了20%到40%。
{"title":"Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference","authors":"Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2025.3626974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626974","url":null,"abstract":"The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"90-105"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDTC: Exact Triangle Counting for Dynamic Graphs on GPU EDTC: GPU上动态图形的精确三角形计数
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3627974
Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang
In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.
在动态图的更新过程中,对一条边的更新可能导致多个三角形的增加或删除,而对多条边的更新可能只导致单个三角形的增加或删除。因此,准确地计算动态图上的三角形是一项具有挑战性的任务。由于动态图形不断更新,GPU的内存可能不足以容纳更大图形的存储。当不断增长的图无法存储时,这就提出了一个挑战。基于哈希和基于二进制搜索的三角形计数算法被认为是最有效的静态图计数算法。然而,当遇到高度的顶点时,基于哈希的三角形计数方法会由于传统的哈希表构造而导致显著的内存浪费,从而导致内存不足。这个问题仍然没有解决。在保证计数精度的前提下,开发了动态图形三角形计数系统EDTC。该系统解决了三个主要问题:1)引入了一种高效的EHTC算法,可以快速准确地计算图中三角形的数量。2)介绍了更新激活CSR (UA-CSR)的概念,以及促进其实现的数据结构。此结构仅将受更新边缘影响的子图部分加载到GPU中,允许在此特定子图上执行计算。3)压缩哈希表旨在减少内存消耗,以及动态共享内存分配(DSA)策略,以充分利用GPU的共享内存。
{"title":"EDTC: Exact Triangle Counting for Dynamic Graphs on GPU","authors":"Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang","doi":"10.1109/TPDS.2025.3627974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627974","url":null,"abstract":"In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"247-259"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1