Resource underutilization has troubled data centers for several decades. On the CPU front, live migration plays a crucial role in reallocating CPU resources. Nevertheless, contemporary Virtual Machine (VM) live migration methods are burdened by substantial resource consumption. In terms of memory management, disaggregated memory offers an effective solution to enhance memory utilization, but leaves a gap in addressing CPU underutilization. Our findings highlight a considerable opportunity to optimize live migration in the context of disaggregated memory systems. We introduce Anemoi, a resource management system that seamlessly integrates VM live migration with memory disaggregation to address the aforementioned gap. In the context of disaggregated memory, remote memory becomes accessible from destination nodes, effectively eliminating the need for extensive network transmission of memory pages, and thereby significantly reducing migration time. In addition, we propose using memory replicas as an optimization to the live migration system. To mitigate the overhead of potential excessive memory consumption, we develop a dedicated compression algorithm. Our evaluations demonstrate that Anemoi leads to a notable 69% reduction in network bandwidth utilization and an impressive 83% reduction in migration time compared to traditional VM live migration. Additionally, our compression algorithm achieves an outstanding space-saving rate of 83.6%.
{"title":"Rethinking Virtual Machines Live Migration for Memory Disaggregation","authors":"Xingzi Yu;Xingguo Jia;Jin Zhang;Yun Wang;Senhao Yu;Zhengwei Qi","doi":"10.1109/TPDS.2025.3597149","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597149","url":null,"abstract":"Resource underutilization has troubled data centers for several decades. On the CPU front, live migration plays a crucial role in reallocating CPU resources. Nevertheless, contemporary Virtual Machine (VM) live migration methods are burdened by substantial resource consumption. In terms of memory management, disaggregated memory offers an effective solution to enhance memory utilization, but leaves a gap in addressing CPU underutilization. Our findings highlight a considerable opportunity to optimize live migration in the context of disaggregated memory systems. We introduce Anemoi, a resource management system that seamlessly integrates VM live migration with memory disaggregation to address the aforementioned gap. In the context of disaggregated memory, remote memory becomes accessible from destination nodes, effectively eliminating the need for extensive network transmission of memory pages, and thereby significantly reducing migration time. In addition, we propose using memory replicas as an optimization to the live migration system. To mitigate the overhead of potential excessive memory consumption, we develop a dedicated compression algorithm. Our evaluations demonstrate that Anemoi leads to a notable 69% reduction in network bandwidth utilization and an impressive 83% reduction in migration time compared to traditional VM live migration. Additionally, our compression algorithm achieves an outstanding space-saving rate of 83.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2310-2324"},"PeriodicalIF":6.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1109/TPDS.2025.3597195
Yepeng Zhang;Haitao Zhang;Huadong Ma
Existing CPU scaling approaches have limitations that can lead to inefficient resource allocation and increased penalty costs for tasks with soft deadlines running in container clouds. First, quota allocation based approaches overlook the gap between the obtainable CPU time and allocated quota, causing inefficient CPU utilization and unexpected task behaviors. Second, core allocation based approaches ignore workload dynamics within decision intervals, potentially increasing contention for CPU time among tasks on the same core. Third, existing approaches lack strategies to allocate more resources to critical tasks that incur higher penalty costs when the node’s capacity is insufficient. This article proposes a reinforcement learning based hybrid CPU scaling approach that allocates quota and cores jointly, aiming to minimize penalty costs for timeouts. Based on the embedding generated from a fine-grained CPU demand series, we allocate CPU quotas and determine a dynamic workload-aware core sharing scheme using an attention mechanism that combines respective demands and global criticality regarding penalty costs. Additionally, we integrate the resource gap, CPU time contention, and penalty costs into the reward function to update our model online. The experimental results show the proposed approach achieves state-of-the-art performance.
{"title":"RL-Based Hybrid CPU Scaling for Soft Deadline Constrained Tasks in Container Clouds","authors":"Yepeng Zhang;Haitao Zhang;Huadong Ma","doi":"10.1109/TPDS.2025.3597195","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597195","url":null,"abstract":"Existing CPU scaling approaches have limitations that can lead to inefficient resource allocation and increased penalty costs for tasks with soft deadlines running in container clouds. First, quota allocation based approaches overlook the gap between the obtainable CPU time and allocated quota, causing inefficient CPU utilization and unexpected task behaviors. Second, core allocation based approaches ignore workload dynamics within decision intervals, potentially increasing contention for CPU time among tasks on the same core. Third, existing approaches lack strategies to allocate more resources to critical tasks that incur higher penalty costs when the node’s capacity is insufficient. This article proposes a reinforcement learning based hybrid CPU scaling approach that allocates quota and cores jointly, aiming to minimize penalty costs for timeouts. Based on the embedding generated from a fine-grained CPU demand series, we allocate CPU quotas and determine a dynamic workload-aware core sharing scheme using an attention mechanism that combines respective demands and global criticality regarding penalty costs. Additionally, we integrate the resource gap, CPU time contention, and penalty costs into the reward function to update our model online. The experimental results show the proposed approach achieves state-of-the-art performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2104-2118"},"PeriodicalIF":6.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose Mariana, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that Mariana achieves $1.7times$ higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.
{"title":"Mariana: Exploring Native SkipList Index Design for Disaggregated Memory","authors":"Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao","doi":"10.1109/TPDS.2025.3596988","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3596988","url":null,"abstract":"Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose <small>Mariana</small>, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that <small>Mariana</small> achieves <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2137-2151"},"PeriodicalIF":6.0,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Serverless edge computing is an efficient way to execute event-driven, short-duration, and bursty IoT data processing tasks on resource-limited edge servers, using on-demand resource allocation and dynamic auto-scaling. In this paradigm, function requests are handled in virtualized environments, e.g., containers. When a function request arrives online, if there is no container in memory to execute it, the serverless platform will initialize such a container with non-negligible latency, known as cold start. Otherwise, it results in a warm start with no latency in previous studies. However, based on our experiments, we find there is a remarkable third case called Late-Warm, i.e., when a request arrives during the container initializing, its latency is less than a cold start but not zero. In this paper, we study online container caching in serverless edge computing to minimize the total latency with Late-Warm and other practical issues considered. We propose OnCoLa, a novel $O(T_{c}K)$-competitive algorithm supporting request relaying on multiple edge servers. Here, $T_{c}$ and $K$ are the maximum container cold start latency and the memory size, respectively. Extensive simulations on two real-world traces demonstrate that OnCoLa consistently outperforms the state-of-the-art container caching algorithms and reduces the latency by 23.33%. Experiments on Raspberry Pi and Jetson Nano show that OnCoLa reduces latency by up to 21.38% compared with the representative lightweight policy.
{"title":"Online Container Caching for IoT Data Processing in Serverless Edge Computing","authors":"Guopeng Li;Haisheng Tan;Chi Zhang;Xuan Zhang;Zhenhua Han;Guoliang Chen","doi":"10.1109/TPDS.2025.3595965","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3595965","url":null,"abstract":"Serverless edge computing is an efficient way to execute event-driven, short-duration, and bursty IoT data processing tasks on resource-limited edge servers, using on-demand resource allocation and dynamic auto-scaling. In this paradigm, function requests are handled in virtualized environments, e.g., containers. When a function request arrives online, if there is no container in memory to execute it, the serverless platform will initialize such a container with non-negligible latency, known as cold start. Otherwise, it results in a warm start with no latency in previous studies. However, based on our experiments, we find there is a remarkable third case called Late-Warm, i.e., when a request arrives during the container initializing, its latency is less than a cold start but not zero. In this paper, we study online container caching in serverless edge computing to minimize the total latency with Late-Warm and other practical issues considered. We propose <monospace>OnCoLa</monospace>, a novel <inline-formula><tex-math>$O(T_{c}K)$</tex-math></inline-formula>-competitive algorithm supporting request relaying on multiple edge servers. Here, <inline-formula><tex-math>$T_{c}$</tex-math></inline-formula> and <inline-formula><tex-math>$K$</tex-math></inline-formula> are the maximum container cold start latency and the memory size, respectively. Extensive simulations on two real-world traces demonstrate that <monospace>OnCoLa</monospace> consistently outperforms the state-of-the-art container caching algorithms and reduces the latency by 23.33%. Experiments on Raspberry Pi and Jetson Nano show that <monospace>OnCoLa</monospace> reduces latency by up to 21.38% compared with the representative lightweight policy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2524-2536"},"PeriodicalIF":6.0,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mobile Virtual Reality (MVR), which aims to provide high-quality VR services to mobile devices of end users, has become the latest trend in virtual reality developments. The current MVR solution is to remotely render frame data from a cloud server, while the potential of edge computing in MVR is underexploited. In this paper, we propose a new approach named MUCVR to achieve high-quality interactive MVR collaboration for multiple users by exploiting edge computing. First, we design “vertical” edge–cloud collaboration for VR task rendering, in which foreground interaction is offloaded to an edge server for rendering, while the background environment is rendered by the cloud server. Correspondingly, the VR device of a user is only responsible for decoding and displaying. Second, we propose the “horizontal” multi-user collaboration based on edge–edge cooperation, which synchronizes the data among edge servers. Finally, we implement the proposed MUCVR on an MVR device and the Unity VR application engine. The results show that MUCVR can effectively reduce the MVR service latency, improve the rendering performance, reduce the computing load on the VR device, and, ultimately, improve users’ quality of experience.
{"title":"MUCVR: Edge Computing-Enabled High-Quality Multi-User Collaboration for Interactive MVR","authors":"Weimin Li;Qin Li;Weihong Tian;Jie Gao;Fan Wu;Jianxun Liu;Ju Ren","doi":"10.1109/TPDS.2025.3595801","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3595801","url":null,"abstract":"Mobile Virtual Reality (MVR), which aims to provide high-quality VR services to mobile devices of end users, has become the latest trend in virtual reality developments. The current MVR solution is to remotely render frame data from a cloud server, while the potential of edge computing in MVR is underexploited. In this paper, we propose a new approach named MUCVR to achieve high-quality interactive MVR collaboration for multiple users by exploiting edge computing. First, we design “vertical” edge–cloud collaboration for VR task rendering, in which foreground interaction is offloaded to an edge server for rendering, while the background environment is rendered by the cloud server. Correspondingly, the VR device of a user is only responsible for decoding and displaying. Second, we propose the “horizontal” multi-user collaboration based on edge–edge cooperation, which synchronizes the data among edge servers. Finally, we implement the proposed MUCVR on an MVR device and the Unity VR application engine. The results show that MUCVR can effectively reduce the MVR service latency, improve the rendering performance, reduce the computing load on the VR device, and, ultimately, improve users’ quality of experience.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2058-2072"},"PeriodicalIF":6.0,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/TPDS.2025.3594694
Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu
The application of federated learning (FL) has been widely extended to medical domains, including medical image analysis and health monitoring. With the increasing computation power demand on edge devices, split federated learning has emerged as a promising FL architecture. In this work, a home healthcare monitoring scenario is explored. Unlike existing split federated learning studies that primarily focus on model-level optimization, this study considers a system-level optimization involving latency, packet error rate, and federated training time. Specifically, a k-means algorithm is presented to select inference nodes, participating training clients, and aggregation servers referring to network conditions and data quality. Furthermore, a reinforcement learning method is utilized to allocate the computation and bandwidth resources during inference, training, and aggregation, thereby further improving the quality of service (QoS) and training efficiency. Simulation results demonstrate that the proposed architecture can achieve the target accuracy while offering the enhanced QoS and reduced the FL training time.
{"title":"Decentralized QoS-Aware Model Inference Using Federated Split Learning for Cloud-Edge Medical Detection","authors":"Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu","doi":"10.1109/TPDS.2025.3594694","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594694","url":null,"abstract":"The application of federated learning (FL) has been widely extended to medical domains, including medical image analysis and health monitoring. With the increasing computation power demand on edge devices, split federated learning has emerged as a promising FL architecture. In this work, a home healthcare monitoring scenario is explored. Unlike existing split federated learning studies that primarily focus on model-level optimization, this study considers a system-level optimization involving latency, packet error rate, and federated training time. Specifically, a <italic>k</i>-means algorithm is presented to select inference nodes, participating training clients, and aggregation servers referring to network conditions and data quality. Furthermore, a reinforcement learning method is utilized to allocate the computation and bandwidth resources during inference, training, and aggregation, thereby further improving the quality of service (QoS) and training efficiency. Simulation results demonstrate that the proposed architecture can achieve the target accuracy while offering the enhanced QoS and reduced the FL training time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2119-2136"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/TPDS.2025.3594741
Bin Deng;Weidong Li
Multiresource allocation mechanisms have been studied in many scenarios. A new dynamic multiresource fair allocation model with time discount utility is proposed in this article, where users can arrive and depart at different time slots. We propose a new any price share time discount (APS-TD) mechanism for this model, which accounts for the users’ time discount utility while maintaining desirable properties. We prove that the APS-TD mechanism satisfies cumulative incentive sharing (CSI), i.e., that the cumulative utility of each user is not lower than the cumulative utility generated by evenly allocating the available resources in each time slot; cumulative strategyproofness (CSP), where users cannot increase their cumulative utility by falsely reporting their demands in any time slot; cumulative Pareto optimality (CPO), i.e., where no allocation can increase the cumulative utility of one user without reducing the cumulative utility of another user in any time slot; cumulative envy-freeness (CEF), where users who arrive later should not prefer allocations from other users who arrive first in any time slot; time discount share fairness (TDSF), where users with higher time discount values occupy larger resource shares in each time slot unless the utility levels of both users are generated by evenly allocating resources; and bottleneck fairness (BF), where the allocation should satisfy max-min fairness with respect to the bottleneck resources contained in each time slot. We run the APS-TD mechanism on Alibaba trace-driven data to demonstrate the performance enhancement achieved by our proposed mechanism over the existing mechanism extensions. The results show that the APS-TD mechanism is superior to hybrid multiresource fairness (H-MRF) and stateful dominant resource fairness (SDRF) in many ways.
{"title":"Dynamic Multiresource Fair Allocation With Time Discount Utility","authors":"Bin Deng;Weidong Li","doi":"10.1109/TPDS.2025.3594741","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594741","url":null,"abstract":"Multiresource allocation mechanisms have been studied in many scenarios. A new dynamic multiresource fair allocation model with time discount utility is proposed in this article, where users can arrive and depart at different time slots. We propose a new <italic>any price share</i> time discount (APS-TD) mechanism for this model, which accounts for the users’ time discount utility while maintaining desirable properties. We prove that the APS-TD mechanism satisfies cumulative incentive sharing (CSI), i.e., that the cumulative utility of each user is not lower than the cumulative utility generated by evenly allocating the available resources in each time slot; cumulative strategyproofness (CSP), where users cannot increase their cumulative utility by falsely reporting their demands in any time slot; cumulative Pareto optimality (CPO), i.e., where no allocation can increase the cumulative utility of one user without reducing the cumulative utility of another user in any time slot; cumulative envy-freeness (CEF), where users who arrive later should not prefer allocations from other users who arrive first in any time slot; time discount share fairness (TDSF), where users with higher time discount values occupy larger resource shares in each time slot unless the utility levels of both users are generated by evenly allocating resources; and bottleneck fairness (BF), where the allocation should satisfy max-min fairness with respect to the bottleneck resources contained in each time slot. We run the APS-TD mechanism on Alibaba trace-driven data to demonstrate the performance enhancement achieved by our proposed mechanism over the existing mechanism extensions. The results show that the APS-TD mechanism is superior to hybrid multiresource fairness (H-MRF) and stateful dominant resource fairness (SDRF) in many ways.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2089-2103"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144868146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-31DOI: 10.1109/TPDS.2025.3593896
Zhaorui Zhang;Sheng Di;Benben Liu;Zhuoran Ji;Guanpeng Li;Xiaoyi Lu;Amelie Chi Zhou;Khalid Ayed Alharthi;Jiannong Cao
Cross-Silo federated learning systems have been identified as an efficient approach to scaling DNN training across geographically-distributed data silos to preserve the privacy of the training data. Communication efficiency and fairness are two major issues that need to be both satisfied when federated learning systems are deployed in practice. Simultaneously guaranteeing both of them, however, is exceptionally difficult because simply combining communication reduction and fairness optimization approaches often causes non-converged training or drastic accuracy degradation. To bridge this gap, we propose FedEFsz. On the one hand, it integrates the state-of-the-art error-bounded lossy compressor SZ3 into cross-silo federated learning systems to significantly reduce communication traffic during the training. On the other hand, it achieves a high fairness (i.e., rather consistent model accuracy and performance across different clients) through a carefully designed heuristic algorithm that can tune the error-bound of SZ3 for different clients during the training. Extensive experimental results based on a GPU cluster with 65 GPU cards show that FedEFsz improves the fairness across different benchmarks by up to 60.88% and meanwhile reduces the communication traffic by up to $315times$.
{"title":"FedEFsz: Fair Cross-Silo Federated Learning System With Error-Bounded Lossy Compression","authors":"Zhaorui Zhang;Sheng Di;Benben Liu;Zhuoran Ji;Guanpeng Li;Xiaoyi Lu;Amelie Chi Zhou;Khalid Ayed Alharthi;Jiannong Cao","doi":"10.1109/TPDS.2025.3593896","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3593896","url":null,"abstract":"Cross-Silo federated learning systems have been identified as an efficient approach to scaling DNN training across geographically-distributed data silos to preserve the privacy of the training data. Communication efficiency and fairness are two major issues that need to be both satisfied when federated learning systems are deployed in practice. Simultaneously guaranteeing both of them, however, is exceptionally difficult because simply combining communication reduction and fairness optimization approaches often causes non-converged training or drastic accuracy degradation. To bridge this gap, we propose <i>FedEFsz</i>. On the one hand, it integrates the state-of-the-art error-bounded lossy compressor SZ3 into cross-silo federated learning systems to significantly reduce communication traffic during the training. On the other hand, it achieves a high fairness (i.e., rather consistent model accuracy and performance across different clients) through a carefully designed heuristic algorithm that can tune the error-bound of SZ3 for different clients during the training. Extensive experimental results based on a GPU cluster with 65 GPU cards show that <i>FedEFsz</i> improves the fairness across different benchmarks by up to 60.88% and meanwhile reduces the communication traffic by up to <inline-formula><tex-math>$315times$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2482-2496"},"PeriodicalIF":6.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-28DOI: 10.1109/TPDS.2025.3593154
Oleksandr Sudakov;Volodymyr Maistrenko
This paper addresses the problem of parallelizing computations to study nonlinear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of nonlinear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 103-108 elements described by Hodgkin–Huxley, FitzHugh–Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.
{"title":"Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment","authors":"Oleksandr Sudakov;Volodymyr Maistrenko","doi":"10.1109/TPDS.2025.3593154","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3593154","url":null,"abstract":"This paper addresses the problem of parallelizing computations to study nonlinear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of nonlinear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 10<sup>3</sup>-10<sup>8</sup> elements described by Hodgkin–Huxley, FitzHugh–Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2030-2044"},"PeriodicalIF":6.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyperconverged Infrastructures (HCIs) combine processing and storage elements to meet the requirements of data-intensive applications in performance, scalability, and quality of service. As an emerging paradigm, HCI should couple with a variety of traditional performance improvement approaches such as I/O caching in virtualized platforms. Contemporary I/O caching schemes are optimized for traditional single-node storage architectures and suffer from two major shortcomings for multi-node architectures: a) imbalanced cache space requirement and b) imbalanced I/O traffic and load. This makes existing schemes inefficient in distributing cache resources over an array of separate physical nodes. In this paper, we propose an Efficient and Load Balanced I/O Cache Architecture (ELICA), managing the solid-state drive (SSD) cache resources across HCI nodes to enhance I/O performance. ELICA dynamically reconfigures and distributes the SSD cache resources throughout the array of HCI nodes and also balances the network traffic and I/O cache load by dynamic reallocation of cache resources. To maximize the performance, we further present an optimization problem defined by Integer Linear Programming to efficiently distribute cache resources and balance the network traffic and I/O cache relocations. Our experimental results on a real platform show that ELICA improves quality of service in terms of average and worst-case latency in HCIs by 3.1× and 23%, respectively, compared to the state-of-the-art.
{"title":"ELICA: Efficient and Load Balanced I/O Cache Architecture for Hyperconverged Infrastructures","authors":"Mostafa Kishani;Sina Ahmadi;Saba Ahmadian;Reza Salkhordeh;Zdenek Becvar;Onur Mutlu;André Brinkmann;Hossein Asadi","doi":"10.1109/TPDS.2025.3592275","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3592275","url":null,"abstract":"<italic>Hyperconverged Infrastructures</i> (HCIs) combine processing and storage elements to meet the requirements of data-intensive applications in performance, scalability, and quality of service. As an emerging paradigm, HCI should couple with a variety of traditional performance improvement approaches such as I/O caching in virtualized platforms. Contemporary I/O caching schemes are optimized for traditional single-node storage architectures and suffer from two major shortcomings for multi-node architectures: a) imbalanced cache space requirement and b) imbalanced I/O traffic and load. This makes existing schemes inefficient in distributing cache resources over an array of separate physical nodes. In this paper, we propose an <italic><u>E</u>fficient and <u>L</u>oad Balanced <u>I</u>/O <u>C</u>ache <u>A</u>rchitecture</i> (ELICA), managing the <italic>solid-state drive</i> (SSD) cache resources across HCI nodes to enhance I/O performance. ELICA dynamically reconfigures and distributes the SSD cache resources throughout the array of HCI nodes and also balances the network traffic and I/O cache load by dynamic reallocation of cache resources. To maximize the performance, we further present an optimization problem defined by <italic>Integer Linear Programming</i> to efficiently distribute cache resources and balance the network traffic and I/O cache relocations. Our experimental results on a real platform show that ELICA improves quality of service in terms of average and worst-case latency in HCIs by 3.1× and 23%, respectively, compared to the state-of-the-art.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2152-2168"},"PeriodicalIF":6.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}