Pub Date : 2025-11-06DOI: 10.1109/TPDS.2025.3629667
Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao
The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.
{"title":"S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks","authors":"Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao","doi":"10.1109/TPDS.2025.3629667","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3629667","url":null,"abstract":"The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"106-121"},"PeriodicalIF":6.0,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/TPDS.2025.3628547
Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li
With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.
{"title":"FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching","authors":"Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li","doi":"10.1109/TPDS.2025.3628547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628547","url":null,"abstract":"With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"155-168"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.
{"title":"Puffer: A Serverless Platform Based on Vertical Memory Scaling","authors":"Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu","doi":"10.1109/TPDS.2025.3628202","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628202","url":null,"abstract":"This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"184-197"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223881","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.
{"title":"Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference","authors":"Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2025.3626974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626974","url":null,"abstract":"The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"90-105"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.
{"title":"EDTC: Exact Triangle Counting for Dynamic Graphs on GPU","authors":"Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang","doi":"10.1109/TPDS.2025.3627974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627974","url":null,"abstract":"In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"247-259"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TPDS.2025.3627574
Yuning Zhang;Nan Yang;Chen Pan;Dong Yuan
The GPU technology has significantly aided Deep Learning (DL), especially in enhancing the performance of inference services. Tenants deploy inference models on the GPU, which are then uniformly scheduled and executed by an inference serving system. In resource-constrained environments, a single GPU needs to handle requests from multiple tenants. The diversity of inference tasks, varying request frequencies, and different model architectures make designing an efficient inference serving system a significant challenge. Most current research discusses resource allocation and request batching separately, overlooking the critical connection between them. In such complex inference environments, this connection is particularly crucial. To rapidly process requests from various tenants in such a dynamic environment, we leverage the connection between resource allocation and request batching to design DRS: Deep Reinforcement Scheduler. In DRS, we use the Deep Deterministic Policy Gradient (DDPG) as our scheduling algorithm and NVIDIA Multi-Process Service (MPS) for spatial parallelism in sharing a single GPU among multiple tenants. By observing environmental information, we can rapidly adjust the GPU allocation for different tenants and find the proper request batch size, thereby maintaining high efficiency. In experiments, DRS achieves a speedup of up to 2.23× and 24× compared to the baselines with the Makespan and Job Completion Time (JCT) metrics.
GPU技术对深度学习(DL)有很大的帮助,特别是在提高推理服务的性能方面。租户将推理模型部署在GPU上,由推理服务系统统一调度执行。在资源受限的环境中,单个GPU需要处理来自多个租户的请求。推理任务的多样性、不同的请求频率和不同的模型体系结构使得设计高效的推理服务系统成为一个重大挑战。目前大多数研究分别讨论了资源分配和请求批处理,忽略了它们之间的关键联系。在这种复杂的推理环境中,这种连接尤为重要。为了在这样一个动态环境中快速处理来自各种租户的请求,我们利用资源分配和请求批处理之间的连接来设计DRS: Deep Reinforcement Scheduler。在DRS中,我们使用深度确定性策略梯度(DDPG)作为调度算法,并使用NVIDIA多进程服务(MPS)在多个租户之间共享单个GPU时实现空间并行性。通过观察环境信息,我们可以快速调整不同租户的GPU分配,找到合适的请求批大小,从而保持高效率。在实验中,与Makespan和Job Completion Time (JCT)指标的基线相比,DRS实现了高达2.23倍和24倍的加速。
{"title":"Joint Optimization of Resource Allocation and Request Batching for Multi-Tenant Inference Serving on GPU","authors":"Yuning Zhang;Nan Yang;Chen Pan;Dong Yuan","doi":"10.1109/TPDS.2025.3627574","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627574","url":null,"abstract":"The GPU technology has significantly aided Deep Learning (DL), especially in enhancing the performance of inference services. Tenants deploy inference models on the GPU, which are then uniformly scheduled and executed by an inference serving system. In resource-constrained environments, a single GPU needs to handle requests from multiple tenants. The diversity of inference tasks, varying request frequencies, and different model architectures make designing an efficient inference serving system a significant challenge. Most current research discusses resource allocation and request batching separately, overlooking the critical connection between them. In such complex inference environments, this connection is particularly crucial. To rapidly process requests from various tenants in such a dynamic environment, we leverage the connection between resource allocation and request batching to design DRS: Deep Reinforcement Scheduler. In DRS, we use the Deep Deterministic Policy Gradient (DDPG) as our scheduling algorithm and NVIDIA Multi-Process Service (MPS) for spatial parallelism in sharing a single GPU among multiple tenants. By observing environmental information, we can rapidly adjust the GPU allocation for different tenants and find the proper request batch size, thereby maintaining high efficiency. In experiments, DRS achieves a speedup of up to 2.23× and 24× compared to the baselines with the Makespan and Job Completion Time (JCT) metrics.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"287-303"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TPDS.2025.3627553
Xiaofeng Lu;Luwen Zou;Yilu Mao;Pietro Lio;Pan Hui
With the rapid advancement of information technology, traditional centralized cloud computing systems face challenges in meeting the stringent low-latency demands of emerging applications. To tackle this issue, this paper proposes a delay-aware cloud-edge architecture that incorporates the distributed characteristics of edge infrastructure, enabling low-latency collaboration among edge nodes within the same geographic region. Furthermore, based on this architecture, a dynamic data replica management scheme is introduced, involving synergistic mechanisms between edge nodes and cloud centers to optimally place data replicas on the most suitable edge nodes. The scheme adopts a hierarchical strategy: edge nodes perform short-term localized management of data replicas, while the cloud executes long-term holistic oversight. Experimental results demonstrate that the dynamic approach effectively reduces user access latency, minimizes replica migration frequency, and decreases network bandwidth consumption.
{"title":"RPCE: Dynamic Data Replicas Placement Management by Cloud and Edge Collaboration","authors":"Xiaofeng Lu;Luwen Zou;Yilu Mao;Pietro Lio;Pan Hui","doi":"10.1109/TPDS.2025.3627553","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627553","url":null,"abstract":"With the rapid advancement of information technology, traditional centralized cloud computing systems face challenges in meeting the stringent low-latency demands of emerging applications. To tackle this issue, this paper proposes a delay-aware cloud-edge architecture that incorporates the distributed characteristics of edge infrastructure, enabling low-latency collaboration among edge nodes within the same geographic region. Furthermore, based on this architecture, a dynamic data replica management scheme is introduced, involving synergistic mechanisms between edge nodes and cloud centers to optimally place data replicas on the most suitable edge nodes. The scheme adopts a hierarchical strategy: edge nodes perform short-term localized management of data replicas, while the cloud executes long-term holistic oversight. Experimental results demonstrate that the dynamic approach effectively reduces user access latency, minimizes replica migration frequency, and decreases network bandwidth consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"548-561"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The microcomputer cluster is a group of connected microcomputers that work together to perform as a single system. Unlike high-performance computer clusters, microcomputer clusters are designed to provide reliable and efficient services for safety-critical embedded systems, which usually require low SWaP (Size, Weight, and Power) because of the high stability and cost control requirements. Considering that safety-critical systems have strict real-time constraints (i.e., deadline constraints) and resource constraints, each microcomputer usually needs to run a Real-Time Operating System (RTOS) instead of Linux to achieve precise scheduling and control, and a high-speed real-time network such as Time-Triggered Ethernet (TTE) is required for intra-cluster communication. In microcomputer clusters, a load imbalance between microcomputers usually leads to system instability, and a lightweight application distribution framework automatically migrates applications among microcomputers, thereby breaking resource isolation and improving resource utilization. However, mainstream application distribution frameworks, such as Kubernetes (K8s), MicroK8s, and K3s, can be applied neither to RTOS nor to TTE. In this study, we design a lightweight application distribution framework with automated and real-time computing and communication (ARC2). ARC2 monitors the microcomputer cluster resource state in real-time and introduces a resource hierarchical pooling method to utilize cluster resources flexibly. It employs TTE for application distribution combined with a real-time scheduling strategy, achieving low end-to-end latency and load balancing. It simplifies the existing application distribution framework and introduces a low-complexity cluster management logic to achieve low resource overhead. We conduct experimental evaluations on a heterogeneous platform. The results show that: (1) the load imbalance is reduced by at least 59.81% compared to the original system; (2) the deviation in real-time monitoring traffic is reduced by an average of 56.7 ms, with the application distribution success rate reaching 100% and an average distribution time of 393.0 ms; and (3) the CPU, memory, and bandwidth overhead are 9%, 3 MB, and 0.104 Mb/s, respectively.
{"title":"Lightweight Application Distribution With Automated and Real-Time Computing and Communication (ARC2) in Microcomputer Clusters","authors":"Jianchun Luo;Zhongjia Wang;Fei Peng;Xuejun Yu;Dongsheng Wei;Bo Liu;Guoqi Xie","doi":"10.1109/TPDS.2025.3626327","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626327","url":null,"abstract":"The microcomputer cluster is a group of connected microcomputers that work together to perform as a single system. Unlike high-performance computer clusters, microcomputer clusters are designed to provide reliable and efficient services for safety-critical embedded systems, which usually require low SWaP (Size, Weight, and Power) because of the high stability and cost control requirements. Considering that safety-critical systems have strict real-time constraints (i.e., deadline constraints) and resource constraints, each microcomputer usually needs to run a Real-Time Operating System (RTOS) instead of Linux to achieve precise scheduling and control, and a high-speed real-time network such as Time-Triggered Ethernet (TTE) is required for intra-cluster communication. In microcomputer clusters, a load imbalance between microcomputers usually leads to system instability, and a lightweight application distribution framework automatically migrates applications among microcomputers, thereby breaking resource isolation and improving resource utilization. However, mainstream application distribution frameworks, such as Kubernetes (K8s), MicroK8s, and K3s, can be applied neither to RTOS nor to TTE. In this study, we design a lightweight application distribution framework with automated and real-time computing and communication (ARC2). ARC2 monitors the microcomputer cluster resource state in real-time and introduces a resource hierarchical pooling method to utilize cluster resources flexibly. It employs TTE for application distribution combined with a real-time scheduling strategy, achieving low end-to-end latency and load balancing. It simplifies the existing application distribution framework and introduces a low-complexity cluster management logic to achieve low resource overhead. We conduct experimental evaluations on a heterogeneous platform. The results show that: (1) the load imbalance is reduced by at least 59.81% compared to the original system; (2) the deviation in real-time monitoring traffic is reduced by an average of 56.7 ms, with the application distribution success rate reaching 100% and an average distribution time of 393.0 ms; and (3) the CPU, memory, and bandwidth overhead are 9%, 3 MB, and 0.104 Mb/s, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"169-183"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1109/TPDS.2025.3626153
Haijie Wu;Xinhua Wang;Xiaoxuan Luo;Wangbo Shen;Weiwei Lin
Since the rapid development of deep learning (DL) technology, large-scale GPU clusters receive a large number of DL workloads daily. To speed up the completion time, the workloads usually occupy several GPUs on a server. However, workload scheduling inevitably generates resource fragmentation, which results in many scattered GPU resources being unavailable. Existing works address improving resource utilization by reducing GPU resource fragmentation, while they focus on resource scheduling for a single cluster and ignore multiple clusters. Multi-cluster scenarios, such as virtual clusters and geo-distributed clusters, require load balancing to avoid some clusters exhausting resources while some clusters are idle while improving resource utilization, which is not well addressed by existing works. In this paper, we propose MCG-Sched, a scheduling strategy to reduce resource fragmentation in multiple GPU clusters while maintaining load balancing among clusters. MCG-Sched measures the fragmented resources with the distribution of workload demands and uses a scheme that minimizes fragmentation in workload scheduling. Meanwhile, MCG-Sched achieves balanced load scheduling across clusters through the load balancing index. MCG-Sched senses the workload requests in the waiting queue, and prioritizes the workloads by combining fragmentation measurement and load balancing index to maximize resource utilization and load balancing during load peak. Our experiments show that MCG-Sched reduces unallocated GPUs up to 1.45× and workload waiting time by more than 40% compared to existing fragmentation-aware methods and achieves effective load balancing.
{"title":"MCG-Sched: Multi-Cluster GPU Scheduling for Resource Fragmentation Reduction and Load Balancing","authors":"Haijie Wu;Xinhua Wang;Xiaoxuan Luo;Wangbo Shen;Weiwei Lin","doi":"10.1109/TPDS.2025.3626153","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626153","url":null,"abstract":"Since the rapid development of deep learning (DL) technology, large-scale GPU clusters receive a large number of DL workloads daily. To speed up the completion time, the workloads usually occupy several GPUs on a server. However, workload scheduling inevitably generates resource fragmentation, which results in many scattered GPU resources being unavailable. Existing works address improving resource utilization by reducing GPU resource fragmentation, while they focus on resource scheduling for a single cluster and ignore multiple clusters. Multi-cluster scenarios, such as virtual clusters and geo-distributed clusters, require load balancing to avoid some clusters exhausting resources while some clusters are idle while improving resource utilization, which is not well addressed by existing works. In this paper, we propose MCG-Sched, a scheduling strategy to reduce resource fragmentation in multiple GPU clusters while maintaining load balancing among clusters. MCG-Sched measures the fragmented resources with the distribution of workload demands and uses a scheme that minimizes fragmentation in workload scheduling. Meanwhile, MCG-Sched achieves balanced load scheduling across clusters through the load balancing index. MCG-Sched senses the workload requests in the waiting queue, and prioritizes the workloads by combining fragmentation measurement and load balancing index to maximize resource utilization and load balancing during load peak. Our experiments show that MCG-Sched reduces unallocated GPUs up to 1.45× and workload waiting time by more than 40% compared to existing fragmentation-aware methods and achieves effective load balancing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2789-2800"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data (PD), scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This article presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.
{"title":"Scalable Hybrid Learning Techniques for Scientific Data Compression","authors":"Tania Banerjee;Jong Choi;Jaemoon Lee;Qian Gong;Jieyang Chen;Scott Klasky;Anand Rangarajan;Sanjay Ranka","doi":"10.1109/TPDS.2025.3623935","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3623935","url":null,"abstract":"Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data (PD), scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This article presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"29-44"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}