Pub Date : 2025-11-17DOI: 10.1109/TPDS.2025.3633298
Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu
For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.
{"title":"Fully Decentralized Data Distribution for Large-Scale HPC Systems","authors":"Ruibo Wang;Mingtian Shao;Wenzhe Zhang;Huijun Wu;Jiaxin Li;Lihua Yang;Di Ma;Yiqin Dai;Kai Lu","doi":"10.1109/TPDS.2025.3633298","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3633298","url":null,"abstract":"For many years, in the HPC data distribution scenario, as the scale of the HPC system continues to increase, manufacturers have to increase the number of data providers to improve the IO parallelism to match the data demanders. In large-scale, especially exascale HPC systems, this mode of decoupling the demander and provider presents significant scalability limitations and incurs substantial costs. In our view, only a distribution model in which the demander also acts as the provider can fundamentally cope with changes in scale and have the best scalability, which is called all-to-all data distribution mode in this paper. We design and implement the BitTorrent protocol on computing networks in HPC systems and propose FD3, a fully decentralized data distribution method. We design the Requested-to-Validated Table (RVT) and the Highest ranking and Longest consecutive piece segment First (HLF) policy based on the features of the HPC networking environment to improve the performance of FD3. In addition, we design a torrent-tree to accelerate the distribution of seed file data and the aggregation of distribution state, and release the tracker load with neighborhood local-generation algorithm. Experimental results show that FD3 can scale smoothly to 11k+ computing nodes, and its performance is much better than that of the parallel file system. Compared with the original BitTorrent, the performance is improved by 8-15 times. FD3 highlights the considerable potential of the all-to-all model in HPC data distribution scenarios. Furthermore, the work of this paper can further stimulate the exploration of future distributed parallel file systems and provide a foundation and inspiration for the design of data access patterns for Exscale HPC systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"304-321"},"PeriodicalIF":6.0,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1109/TPDS.2025.3632073
Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li
Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.
{"title":"DAHBM-GCN: A Flexible Graph Convolution Network Accelerator With Multiple Dataflows and HBM","authors":"Xian Zhang;Guoqing Xiao;Jiapeng Zhang;Mingxing Duan;Kenli Li","doi":"10.1109/TPDS.2025.3632073","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632073","url":null,"abstract":"Graph-structured data has been widely applied in transportation, molecular, and e-commerce networks, etc. Graph Convolutional Network (GCN) has emerged as an efficient approach to processing non-Euclidean graph data. However, the varying sizes and sparsity of graph datasets, coupled with the dependency of the dataflow patterns in GCN computation on the graph data, have rendered the acceleration of GCN inference increasingly challenging. This paper proposes a GCN inference accelerator based on multi-dataflow and high bandwidth memory (HBM), named DAHBM-GCN. Firstly, we designed a computing engine that supports multiple dataflows, aggregation-first, and combination-first orders. Furthermore, an adaptive selector for the multi-dataflow computing engine based on the decision tree is proposed to select the optimal dataflow computing engine. Secondly, an efficient mapping of pseudo channels (PCs) for multi-channel HBM is devised to enhance bandwidth, effectively alleviating memory latency and bandwidth bottlenecks. Thirdly, a hybrid fixed-point quantization strategy for GCN is introduced, which reduces the GCN model’s computation complexity and parameter count with almost no loss of accuracy. Finally, extensive performance evaluation experiments demonstrate that across various datasets, DAHBM-GCN achieved average speedups of 52.5–129.3× and 4.9–7.9× compared to PyG-GCN and DGL-GCN on CPU, respectively. Compared to the AWB-GCN, HyGCN, HLS-GCN, and GCNAX accelerators FPGA-based, DAHBM-GCN also exhibits average speedups of 1.21-2.21×, 1.25-1.98×, 1.65-2.68×, and 1.18-1.56× respectively, on various datasets. Additionally, DAHBM-GCN possesses the advantages of high flexibility and low energy consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"213-229"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1109/TPDS.2025.3632089
Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers
Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for computation-communication-separated orchestration to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.
{"title":"HyFaaS: Accelerating Serverless Workflows by Unleashing Hybrid Resource Elasticity","authors":"Xiaofei Yue;Song Yang;Fan Li;Liehuang Zhu;Xu Wang;Zhen Feng;Fernando A. Kuipers","doi":"10.1109/TPDS.2025.3632089","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3632089","url":null,"abstract":"Serverless computing promises fine-grained resource elasticity and billing, making it an attractive way to build complex applications as multi-stage workflows. Nonetheless, existing workflow orchestration ignores the heterogeneous demands of the computation and communication parts within a stage, potentially resulting in resource inefficiency on either side. In this paper, we advocate for <italic>computation-communication-separated orchestration</i> to unleash hybrid resource (i.e., compute and network) elasticity. We present HyFaaS, a serverless workflow orchestrator that improves performance while ensuring cost efficiency. It seamlessly decouples computation and communication as a series of hybrid stages re-expressed within HyDAG, a novel workflow abstraction. HyFaaS uses a gray-box profiling model to identify their Pareto-optimal saturated configurations, and then deploys the saturated workflow to juggle communication and scaling overheads through two-level HyDAG partitioning. Along with event-driven runtime fine-tuning, HyFaaS further scales down the non-critical stages to reduce cost via branch-aware coordination. Experimental results show that HyFaaS surpasses existing solutions by 32.7%–50.4% on end-to-end latency, while lowering cost by up to 1.37×.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"272-286"},"PeriodicalIF":6.0,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.
{"title":"D3T: Dual-Timescale Optimization of Task Scheduling and Thermal Management for Energy Efficient Geo-Distributed Data Centers","authors":"Yongyi Ran;Hui Yin;Tongyao Sun;Xin Zhou;Jiangtao Luo;Shuangwu Chen","doi":"10.1109/TPDS.2025.3631654","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631654","url":null,"abstract":"The surge of artificial intelligence (AI) has intensified compute-intensive tasks, sharply increasing the need for energy-efficient management in geo-distributed data centers. Existing approaches struggle to coordinate task scheduling and cooling control due to mismatched time constants, stochastic Information Technology (IT) workloads, variable renewable energy, and fluctuating electricity prices. To address these challenges, we propose D3T, a dual-timescale deep reinforcement learning (DRL) framework that jointly optimizes task scheduling and thermal management for energy-efficient geo-distributed data centers. At the fast timescale, D3T employs Deep Q-Network (DQN) to schedule tasks, reducing operational expenditure (OPEX) and task sojourn time. At the slow timescale, a QMIX-based multi-agent DRL method regulates cooling across distributed data centers by dynamically adjusting airflow rates, thereby preventing hotspots and reducing energy waste. Extensive experiments were conducted using TRNSYS with real-world traces, and the results demonstrate that, compared to baseline algorithms, D3T reduces OPEX by 13% in IT subsystems and 29% in cooling subsystems, improves power usage effectiveness (PUE) by 7%, and maintains more stable thermal safety across geo-distributed data centers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"230-246"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
{"title":"How to Evaluate Distributed Coordination Systems?–A Survey and Analysis","authors":"Bekir Turkkan;Elvis Rodrigues;Tevfik Kosar;Aleksey Charapko;Ailidani Ailijiang;Murat Demirbas","doi":"10.1109/TPDS.2025.3631614","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3631614","url":null,"abstract":"Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"198-212"},"PeriodicalIF":6.0,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1109/TPDS.2025.3629667
Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao
The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.
{"title":"S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks","authors":"Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao","doi":"10.1109/TPDS.2025.3629667","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3629667","url":null,"abstract":"The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"106-121"},"PeriodicalIF":6.0,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/TPDS.2025.3628547
Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li
With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.
{"title":"FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching","authors":"Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li","doi":"10.1109/TPDS.2025.3628547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628547","url":null,"abstract":"With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"155-168"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.
{"title":"Puffer: A Serverless Platform Based on Vertical Memory Scaling","authors":"Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu","doi":"10.1109/TPDS.2025.3628202","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628202","url":null,"abstract":"This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"184-197"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223881","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.
{"title":"Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference","authors":"Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2025.3626974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626974","url":null,"abstract":"The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"90-105"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.
{"title":"EDTC: Exact Triangle Counting for Dynamic Graphs on GPU","authors":"Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang","doi":"10.1109/TPDS.2025.3627974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627974","url":null,"abstract":"In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"247-259"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}