Mobile Virtual Reality (MVR), which aims to provide high-quality VR services to mobile devices of end users, has become the latest trend in virtual reality developments. The current MVR solution is to remotely render frame data from a cloud server, while the potential of edge computing in MVR is underexploited. In this paper, we propose a new approach named MUCVR to achieve high-quality interactive MVR collaboration for multiple users by exploiting edge computing. First, we design “vertical” edge–cloud collaboration for VR task rendering, in which foreground interaction is offloaded to an edge server for rendering, while the background environment is rendered by the cloud server. Correspondingly, the VR device of a user is only responsible for decoding and displaying. Second, we propose the “horizontal” multi-user collaboration based on edge–edge cooperation, which synchronizes the data among edge servers. Finally, we implement the proposed MUCVR on an MVR device and the Unity VR application engine. The results show that MUCVR can effectively reduce the MVR service latency, improve the rendering performance, reduce the computing load on the VR device, and, ultimately, improve users’ quality of experience.
{"title":"MUCVR: Edge Computing-Enabled High-Quality Multi-User Collaboration for Interactive MVR","authors":"Weimin Li;Qin Li;Weihong Tian;Jie Gao;Fan Wu;Jianxun Liu;Ju Ren","doi":"10.1109/TPDS.2025.3595801","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3595801","url":null,"abstract":"Mobile Virtual Reality (MVR), which aims to provide high-quality VR services to mobile devices of end users, has become the latest trend in virtual reality developments. The current MVR solution is to remotely render frame data from a cloud server, while the potential of edge computing in MVR is underexploited. In this paper, we propose a new approach named MUCVR to achieve high-quality interactive MVR collaboration for multiple users by exploiting edge computing. First, we design “vertical” edge–cloud collaboration for VR task rendering, in which foreground interaction is offloaded to an edge server for rendering, while the background environment is rendered by the cloud server. Correspondingly, the VR device of a user is only responsible for decoding and displaying. Second, we propose the “horizontal” multi-user collaboration based on edge–edge cooperation, which synchronizes the data among edge servers. Finally, we implement the proposed MUCVR on an MVR device and the Unity VR application engine. The results show that MUCVR can effectively reduce the MVR service latency, improve the rendering performance, reduce the computing load on the VR device, and, ultimately, improve users’ quality of experience.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2058-2072"},"PeriodicalIF":6.0,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144843117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/TPDS.2025.3594694
Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu
The application of federated learning (FL) has been widely extended to medical domains, including medical image analysis and health monitoring. With the increasing computation power demand on edge devices, split federated learning has emerged as a promising FL architecture. In this work, a home healthcare monitoring scenario is explored. Unlike existing split federated learning studies that primarily focus on model-level optimization, this study considers a system-level optimization involving latency, packet error rate, and federated training time. Specifically, a k-means algorithm is presented to select inference nodes, participating training clients, and aggregation servers referring to network conditions and data quality. Furthermore, a reinforcement learning method is utilized to allocate the computation and bandwidth resources during inference, training, and aggregation, thereby further improving the quality of service (QoS) and training efficiency. Simulation results demonstrate that the proposed architecture can achieve the target accuracy while offering the enhanced QoS and reduced the FL training time.
{"title":"Decentralized QoS-Aware Model Inference Using Federated Split Learning for Cloud-Edge Medical Detection","authors":"Yishan Chen;Xiangwei Zeng;Huashuai Cai;Qing Xu;Zhiquan Liu","doi":"10.1109/TPDS.2025.3594694","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594694","url":null,"abstract":"The application of federated learning (FL) has been widely extended to medical domains, including medical image analysis and health monitoring. With the increasing computation power demand on edge devices, split federated learning has emerged as a promising FL architecture. In this work, a home healthcare monitoring scenario is explored. Unlike existing split federated learning studies that primarily focus on model-level optimization, this study considers a system-level optimization involving latency, packet error rate, and federated training time. Specifically, a <italic>k</i>-means algorithm is presented to select inference nodes, participating training clients, and aggregation servers referring to network conditions and data quality. Furthermore, a reinforcement learning method is utilized to allocate the computation and bandwidth resources during inference, training, and aggregation, thereby further improving the quality of service (QoS) and training efficiency. Simulation results demonstrate that the proposed architecture can achieve the target accuracy while offering the enhanced QoS and reduced the FL training time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2119-2136"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/TPDS.2025.3594741
Bin Deng;Weidong Li
Multiresource allocation mechanisms have been studied in many scenarios. A new dynamic multiresource fair allocation model with time discount utility is proposed in this article, where users can arrive and depart at different time slots. We propose a new any price share time discount (APS-TD) mechanism for this model, which accounts for the users’ time discount utility while maintaining desirable properties. We prove that the APS-TD mechanism satisfies cumulative incentive sharing (CSI), i.e., that the cumulative utility of each user is not lower than the cumulative utility generated by evenly allocating the available resources in each time slot; cumulative strategyproofness (CSP), where users cannot increase their cumulative utility by falsely reporting their demands in any time slot; cumulative Pareto optimality (CPO), i.e., where no allocation can increase the cumulative utility of one user without reducing the cumulative utility of another user in any time slot; cumulative envy-freeness (CEF), where users who arrive later should not prefer allocations from other users who arrive first in any time slot; time discount share fairness (TDSF), where users with higher time discount values occupy larger resource shares in each time slot unless the utility levels of both users are generated by evenly allocating resources; and bottleneck fairness (BF), where the allocation should satisfy max-min fairness with respect to the bottleneck resources contained in each time slot. We run the APS-TD mechanism on Alibaba trace-driven data to demonstrate the performance enhancement achieved by our proposed mechanism over the existing mechanism extensions. The results show that the APS-TD mechanism is superior to hybrid multiresource fairness (H-MRF) and stateful dominant resource fairness (SDRF) in many ways.
{"title":"Dynamic Multiresource Fair Allocation With Time Discount Utility","authors":"Bin Deng;Weidong Li","doi":"10.1109/TPDS.2025.3594741","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3594741","url":null,"abstract":"Multiresource allocation mechanisms have been studied in many scenarios. A new dynamic multiresource fair allocation model with time discount utility is proposed in this article, where users can arrive and depart at different time slots. We propose a new <italic>any price share</i> time discount (APS-TD) mechanism for this model, which accounts for the users’ time discount utility while maintaining desirable properties. We prove that the APS-TD mechanism satisfies cumulative incentive sharing (CSI), i.e., that the cumulative utility of each user is not lower than the cumulative utility generated by evenly allocating the available resources in each time slot; cumulative strategyproofness (CSP), where users cannot increase their cumulative utility by falsely reporting their demands in any time slot; cumulative Pareto optimality (CPO), i.e., where no allocation can increase the cumulative utility of one user without reducing the cumulative utility of another user in any time slot; cumulative envy-freeness (CEF), where users who arrive later should not prefer allocations from other users who arrive first in any time slot; time discount share fairness (TDSF), where users with higher time discount values occupy larger resource shares in each time slot unless the utility levels of both users are generated by evenly allocating resources; and bottleneck fairness (BF), where the allocation should satisfy max-min fairness with respect to the bottleneck resources contained in each time slot. We run the APS-TD mechanism on Alibaba trace-driven data to demonstrate the performance enhancement achieved by our proposed mechanism over the existing mechanism extensions. The results show that the APS-TD mechanism is superior to hybrid multiresource fairness (H-MRF) and stateful dominant resource fairness (SDRF) in many ways.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2089-2103"},"PeriodicalIF":6.0,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144868146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-31DOI: 10.1109/TPDS.2025.3593896
Zhaorui Zhang;Sheng Di;Benben Liu;Zhuoran Ji;Guanpeng Li;Xiaoyi Lu;Amelie Chi Zhou;Khalid Ayed Alharthi;Jiannong Cao
Cross-Silo federated learning systems have been identified as an efficient approach to scaling DNN training across geographically-distributed data silos to preserve the privacy of the training data. Communication efficiency and fairness are two major issues that need to be both satisfied when federated learning systems are deployed in practice. Simultaneously guaranteeing both of them, however, is exceptionally difficult because simply combining communication reduction and fairness optimization approaches often causes non-converged training or drastic accuracy degradation. To bridge this gap, we propose FedEFsz. On the one hand, it integrates the state-of-the-art error-bounded lossy compressor SZ3 into cross-silo federated learning systems to significantly reduce communication traffic during the training. On the other hand, it achieves a high fairness (i.e., rather consistent model accuracy and performance across different clients) through a carefully designed heuristic algorithm that can tune the error-bound of SZ3 for different clients during the training. Extensive experimental results based on a GPU cluster with 65 GPU cards show that FedEFsz improves the fairness across different benchmarks by up to 60.88% and meanwhile reduces the communication traffic by up to $315times$.
{"title":"FedEFsz: Fair Cross-Silo Federated Learning System With Error-Bounded Lossy Compression","authors":"Zhaorui Zhang;Sheng Di;Benben Liu;Zhuoran Ji;Guanpeng Li;Xiaoyi Lu;Amelie Chi Zhou;Khalid Ayed Alharthi;Jiannong Cao","doi":"10.1109/TPDS.2025.3593896","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3593896","url":null,"abstract":"Cross-Silo federated learning systems have been identified as an efficient approach to scaling DNN training across geographically-distributed data silos to preserve the privacy of the training data. Communication efficiency and fairness are two major issues that need to be both satisfied when federated learning systems are deployed in practice. Simultaneously guaranteeing both of them, however, is exceptionally difficult because simply combining communication reduction and fairness optimization approaches often causes non-converged training or drastic accuracy degradation. To bridge this gap, we propose <i>FedEFsz</i>. On the one hand, it integrates the state-of-the-art error-bounded lossy compressor SZ3 into cross-silo federated learning systems to significantly reduce communication traffic during the training. On the other hand, it achieves a high fairness (i.e., rather consistent model accuracy and performance across different clients) through a carefully designed heuristic algorithm that can tune the error-bound of SZ3 for different clients during the training. Extensive experimental results based on a GPU cluster with 65 GPU cards show that <i>FedEFsz</i> improves the fairness across different benchmarks by up to 60.88% and meanwhile reduces the communication traffic by up to <inline-formula><tex-math>$315times$</tex-math></inline-formula>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2482-2496"},"PeriodicalIF":6.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-28DOI: 10.1109/TPDS.2025.3593154
Oleksandr Sudakov;Volodymyr Maistrenko
This paper addresses the problem of parallelizing computations to study nonlinear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of nonlinear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 103-108 elements described by Hodgkin–Huxley, FitzHugh–Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.
{"title":"Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment","authors":"Oleksandr Sudakov;Volodymyr Maistrenko","doi":"10.1109/TPDS.2025.3593154","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3593154","url":null,"abstract":"This paper addresses the problem of parallelizing computations to study nonlinear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of nonlinear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 10<sup>3</sup>-10<sup>8</sup> elements described by Hodgkin–Huxley, FitzHugh–Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2030-2044"},"PeriodicalIF":6.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyperconverged Infrastructures (HCIs) combine processing and storage elements to meet the requirements of data-intensive applications in performance, scalability, and quality of service. As an emerging paradigm, HCI should couple with a variety of traditional performance improvement approaches such as I/O caching in virtualized platforms. Contemporary I/O caching schemes are optimized for traditional single-node storage architectures and suffer from two major shortcomings for multi-node architectures: a) imbalanced cache space requirement and b) imbalanced I/O traffic and load. This makes existing schemes inefficient in distributing cache resources over an array of separate physical nodes. In this paper, we propose an Efficient and Load Balanced I/O Cache Architecture (ELICA), managing the solid-state drive (SSD) cache resources across HCI nodes to enhance I/O performance. ELICA dynamically reconfigures and distributes the SSD cache resources throughout the array of HCI nodes and also balances the network traffic and I/O cache load by dynamic reallocation of cache resources. To maximize the performance, we further present an optimization problem defined by Integer Linear Programming to efficiently distribute cache resources and balance the network traffic and I/O cache relocations. Our experimental results on a real platform show that ELICA improves quality of service in terms of average and worst-case latency in HCIs by 3.1× and 23%, respectively, compared to the state-of-the-art.
{"title":"ELICA: Efficient and Load Balanced I/O Cache Architecture for Hyperconverged Infrastructures","authors":"Mostafa Kishani;Sina Ahmadi;Saba Ahmadian;Reza Salkhordeh;Zdenek Becvar;Onur Mutlu;André Brinkmann;Hossein Asadi","doi":"10.1109/TPDS.2025.3592275","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3592275","url":null,"abstract":"<italic>Hyperconverged Infrastructures</i> (HCIs) combine processing and storage elements to meet the requirements of data-intensive applications in performance, scalability, and quality of service. As an emerging paradigm, HCI should couple with a variety of traditional performance improvement approaches such as I/O caching in virtualized platforms. Contemporary I/O caching schemes are optimized for traditional single-node storage architectures and suffer from two major shortcomings for multi-node architectures: a) imbalanced cache space requirement and b) imbalanced I/O traffic and load. This makes existing schemes inefficient in distributing cache resources over an array of separate physical nodes. In this paper, we propose an <italic><u>E</u>fficient and <u>L</u>oad Balanced <u>I</u>/O <u>C</u>ache <u>A</u>rchitecture</i> (ELICA), managing the <italic>solid-state drive</i> (SSD) cache resources across HCI nodes to enhance I/O performance. ELICA dynamically reconfigures and distributes the SSD cache resources throughout the array of HCI nodes and also balances the network traffic and I/O cache load by dynamic reallocation of cache resources. To maximize the performance, we further present an optimization problem defined by <italic>Integer Linear Programming</i> to efficiently distribute cache resources and balance the network traffic and I/O cache relocations. Our experimental results on a real platform show that ELICA improves quality of service in terms of average and worst-case latency in HCIs by 3.1× and 23%, respectively, compared to the state-of-the-art.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2152-2168"},"PeriodicalIF":6.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with $O(10^{3})$ nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability () of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of ${sim} 10^{8}$ stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better scores due to higher GPU occupancies.
{"title":"Performance Portability Assessment in Gaia","authors":"Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci","doi":"10.1109/TPDS.2025.3591452","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591452","url":null,"abstract":"Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with <inline-formula><tex-math>$O(10^{3})$</tex-math></inline-formula> nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability (<inline-graphic>) of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of <inline-formula><tex-math>${sim} 10^{8}$</tex-math></inline-formula> stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better <inline-graphic> scores due to higher GPU occupancies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2045-2057"},"PeriodicalIF":6.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11090032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.
{"title":"Task Scheduling in Geo-Distributed Computing: A Survey","authors":"Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng","doi":"10.1109/TPDS.2025.3591010","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591010","url":null,"abstract":"Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2073-2088"},"PeriodicalIF":6.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144867934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TPDS.2025.3590368
Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo
In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a $(1-1/e)$ approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of $mathcal {O}(NVsqrt{Tln T})$, with $N$, $V$, and $T$ denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.
{"title":"Doing More With Less: Balancing Probing Costs and Task Offloading Efficiency At the Network Edge","authors":"Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo","doi":"10.1109/TPDS.2025.3590368","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590368","url":null,"abstract":"In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a <inline-formula><tex-math>$(1-1/e)$</tex-math></inline-formula> approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of <inline-formula><tex-math>$mathcal {O}(NVsqrt{Tln T})$</tex-math></inline-formula>, with <inline-formula><tex-math>$N$</tex-math></inline-formula>, <inline-formula><tex-math>$V$</tex-math></inline-formula>, and <inline-formula><tex-math>$T$</tex-math></inline-formula> denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2247-2263"},"PeriodicalIF":6.0,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-17DOI: 10.1109/TPDS.2025.3590014
Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin
Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service $> $ 90% .
{"title":"Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving","authors":"Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin","doi":"10.1109/TPDS.2025.3590014","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590014","url":null,"abstract":"Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service <inline-formula><tex-math>$> $</tex-math></inline-formula> 90% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1972-1984"},"PeriodicalIF":6.0,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}