Pub Date : 2024-11-28DOI: 10.1109/TPDS.2024.3508275
Xiaodong Dong;Lihai Nie;Zheli Liu;Yang Xiang
Inter-datacenter network applications generate massive coflows for purposes, e.g., backup, synchronization, and analytics, with deadline requirements. Decentralized coflow scheduling frameworks are desirable for their scalability in cross-domain deployment but grappling with the challenge of information agnosticism for lack of cross-domain privileges. Current information-agnostic coflow scheduling methods are incompatible with decentralized frameworks for relying on centralized controllers to continuously monitor and learn from coflow global transmission states to infer global coflow information. Alternative methods propose mechanisms for decentralized global coflow information gathering and synchronization. However, they require dedicated physical hardware or control logic, which could be impractical for incremental deployment. This article proposes Slark, a decentralized deadline-aware coflow scheduling framework, which meets coflows’ soft and hard deadline requirements using only local traffic information. It eschews requiring global coflow transmission states and dedicated hardware or control logic by leveraging multiple software-implemented scheduling agents working independently on each node and integrating such information agnosticism into node-specific bandwidth allocation by modeling it as a robust optimization problem with flow information on the other nodes represented as uncertain parameters. Subsequently, we validate the performance robustness of Slark by investigating how perturbations in the optimal objective function value and the associated optimal solution are affected by uncertain parameters. Finally, we propose a firebug-swarm-optimization-based heuristic algorithm to tackle the non-convexity in our problem. Experimental results demonstrate that Slark can significantly enhance transmission revenue and increase soft and hard deadline guarantee ratios by 10.52% and 7.99% on average.
{"title":"Slark: A Performance Robust Decentralized Inter-Datacenter Deadline-Aware Coflows Scheduling Framework With Local Information","authors":"Xiaodong Dong;Lihai Nie;Zheli Liu;Yang Xiang","doi":"10.1109/TPDS.2024.3508275","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3508275","url":null,"abstract":"Inter-datacenter network applications generate massive coflows for purposes, e.g., backup, synchronization, and analytics, with deadline requirements. Decentralized coflow scheduling frameworks are desirable for their scalability in cross-domain deployment but grappling with the challenge of information agnosticism for lack of cross-domain privileges. Current information-agnostic coflow scheduling methods are incompatible with decentralized frameworks for relying on centralized controllers to continuously monitor and learn from coflow global transmission states to infer global coflow information. Alternative methods propose mechanisms for decentralized global coflow information gathering and synchronization. However, they require dedicated physical hardware or control logic, which could be impractical for incremental deployment. This article proposes Slark, a decentralized deadline-aware coflow scheduling framework, which meets coflows’ soft and hard deadline requirements using only local traffic information. It eschews requiring global coflow transmission states and dedicated hardware or control logic by leveraging multiple software-implemented scheduling agents working independently on each node and integrating such information agnosticism into node-specific bandwidth allocation by modeling it as a robust optimization problem with flow information on the other nodes represented as uncertain parameters. Subsequently, we validate the performance robustness of Slark by investigating how perturbations in the optimal objective function value and the associated optimal solution are affected by uncertain parameters. Finally, we propose a firebug-swarm-optimization-based heuristic algorithm to tackle the non-convexity in our problem. Experimental results demonstrate that Slark can significantly enhance transmission revenue and increase soft and hard deadline guarantee ratios by 10.52% and 7.99% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"197-211"},"PeriodicalIF":5.6,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1109/TPDS.2024.3506588
Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang
Distributed training of deep neural networks (DNNs) suffers from efficiency declines in dynamic heterogeneous environments, due to the resource wastage brought by the straggler problem in data parallelism (DP) and pipeline bubbles in model parallelism (MP). Additionally, the limited resource availability requires a trade-off between training performance and long-term costs, particularly in online settings. To address these challenges, this article presents a novel online approach to maximize long-term training efficiency in heterogeneous environments through uneven data assignment and communication-aware model partitioning. A group-based hierarchical architecture combining DP and MP is developed to balance discrepant computation and communication capabilities, and offer a flexible parallel mechanism. In order to jointly optimize the performance and long-term cost of the online DL training process, we formulate this problem as a stochastic optimization with time-averaged constraints. By utilizing Lyapunov’s stochastic network optimization theory, we decompose it into several instantaneous sub-optimizations, and devise an effective online solution to address them based on tentative searching and linear solving. We have implemented a prototype system and evaluated the effectiveness of our solution based on realistic experiments, reducing batch training time by up to 68.59% over state-of-the-art methods.
{"title":"Joint Dynamic Data and Model Parallelism for Distributed Training of DNNs Over Heterogeneous Infrastructure","authors":"Zhi Ling;Xiaofeng Jiang;Xiaobin Tan;Huasen He;Shiyin Zhu;Jian Yang","doi":"10.1109/TPDS.2024.3506588","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506588","url":null,"abstract":"Distributed training of deep neural networks (DNNs) suffers from efficiency declines in dynamic heterogeneous environments, due to the resource wastage brought by the straggler problem in data parallelism (DP) and pipeline bubbles in model parallelism (MP). Additionally, the limited resource availability requires a trade-off between training performance and long-term costs, particularly in online settings. To address these challenges, this article presents a novel online approach to maximize long-term training efficiency in heterogeneous environments through uneven data assignment and communication-aware model partitioning. A group-based hierarchical architecture combining DP and MP is developed to balance discrepant computation and communication capabilities, and offer a flexible parallel mechanism. In order to jointly optimize the performance and long-term cost of the online DL training process, we formulate this problem as a stochastic optimization with time-averaged constraints. By utilizing Lyapunov’s stochastic network optimization theory, we decompose it into several instantaneous sub-optimizations, and devise an effective online solution to address them based on tentative searching and linear solving. We have implemented a prototype system and evaluated the effectiveness of our solution based on realistic experiments, reducing batch training time by up to 68.59% over state-of-the-art methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"150-167"},"PeriodicalIF":5.6,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25DOI: 10.1109/TPDS.2024.3506625
Pengwei Wang;Junye Qiao;Yuying Zhao;Zhijun Ding
Edge storage offers low-latency services to users. However, due to strained edge resources and high costs, enterprises must choose the data that most warrant placement at the edge and place it in the right location. In practice, data exhibit temporal and spatial properties, and variability, which have a significant impact on their placement, but have been largely ignored in research. To address this, we introduce the concept of data temperature, which considers data characteristics over time and space. To consider the influence of spatial relevance among different regions for placing data, inspired by PageRank, we present a model using data temperature to assess the regional value of data, which effectively leverages collaboration within the edge storage system. We also propose a regional value-based algorithm (RVA) that minimizes cost while meeting user response time requirements. By taking into account the correlation between regions, the RVA can achieve lower latency than current methods when creating an equal or even smaller number of replicas. Experimental results validate the efficacy of the proposed method in terms of latency, success rate, and cost efficiency.
{"title":"Cost-Effective and Low-Latency Data Placement in Edge Environment Based on PageRank-Inspired Regional Value","authors":"Pengwei Wang;Junye Qiao;Yuying Zhao;Zhijun Ding","doi":"10.1109/TPDS.2024.3506625","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3506625","url":null,"abstract":"Edge storage offers low-latency services to users. However, due to strained edge resources and high costs, enterprises must choose the data that most warrant placement at the edge and place it in the right location. In practice, data exhibit temporal and spatial properties, and variability, which have a significant impact on their placement, but have been largely ignored in research. To address this, we introduce the concept of data temperature, which considers data characteristics over time and space. To consider the influence of spatial relevance among different regions for placing data, inspired by PageRank, we present a model using data temperature to assess the regional value of data, which effectively leverages collaboration within the edge storage system. We also propose a regional value-based algorithm (RVA) that minimizes cost while meeting user response time requirements. By taking into account the correlation between regions, the RVA can achieve lower latency than current methods when creating an equal or even smaller number of replicas. Experimental results validate the efficacy of the proposed method in terms of latency, success rate, and cost efficiency.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"185-196"},"PeriodicalIF":5.6,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-18DOI: 10.1109/TPDS.2024.3501581
Jialiang Han;Yudong Han;Xiang Jing;Gang Huang;Yun Ma
Federated learning (FL) is an emerging promising paradigm of privacy-preserving machine learning (ML). An important type of FL is cross-silo FL, which enables a moderate number of organizations to cooperatively train a shared model by keeping confidential data locally and aggregating gradients on a central parameter server. However, the central server may be vulnerable to malicious attacks or software failures in practice. To address this issue, in this paper, we propose $mathtt{DegaFL} $