{"title":"Straggler mitigation via hierarchical scheduling in elastic stream computing systems","authors":"Minghui Wu , Dawei Sun , Shang Gao , Rajkumar Buyya","doi":"10.1016/j.future.2024.107673","DOIUrl":null,"url":null,"abstract":"<div><div>Skewed data distribution leads to certain tasks or nodes handling much more data than others, thereby slowing down their execution speed and classifying them as stragglers. Existing solutions attempt to establish a well-balanced workload to mitigate stragglers by using either data stream grouping or task scheduling. This “one size fits all” approach only considers single-level requirements and fails to address the diverse needs of the system across multiple levels, ultimately limiting its performance. To address these issues and mitigate stragglers effectively, we propose a hierarchical collaborative strategy called Ms-Stream. It aims to balance the data stream workloads among tasks and maintain load difference among compute nodes within an acceptable range. This paper discusses this strategy from the following aspects: (1) Ms-Stream constructs models for topology, grouping, and resource, along with the formalization of problems, including data stream grouping, task subgraph partitioning, and task deployment. (2) Ms-Stream employs a lightweight two-level grouping method to support dynamic workload assignment for stateful tasks, selectively offloading resources from task stragglers to others. (3) Ms-Stream allocates communication-intensive tasks to the same group through the directed acyclic graph representations of streaming applications, concurrently ensuring the equitable distribution of computation-intensive tasks across groups. (4) Ms-Stream deploys task groups to compute nodes with varying resource capacities following the descending maximum padding priority rule for a balanced workload. Performance metrics such as system throughput and latency are evaluated with real-world streaming applications. Experimental results demonstrate the significant improvements made by Ms-Stream, reducing maximum system latency by 61% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107673"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X2400637X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Skewed data distribution leads to certain tasks or nodes handling much more data than others, thereby slowing down their execution speed and classifying them as stragglers. Existing solutions attempt to establish a well-balanced workload to mitigate stragglers by using either data stream grouping or task scheduling. This “one size fits all” approach only considers single-level requirements and fails to address the diverse needs of the system across multiple levels, ultimately limiting its performance. To address these issues and mitigate stragglers effectively, we propose a hierarchical collaborative strategy called Ms-Stream. It aims to balance the data stream workloads among tasks and maintain load difference among compute nodes within an acceptable range. This paper discusses this strategy from the following aspects: (1) Ms-Stream constructs models for topology, grouping, and resource, along with the formalization of problems, including data stream grouping, task subgraph partitioning, and task deployment. (2) Ms-Stream employs a lightweight two-level grouping method to support dynamic workload assignment for stateful tasks, selectively offloading resources from task stragglers to others. (3) Ms-Stream allocates communication-intensive tasks to the same group through the directed acyclic graph representations of streaming applications, concurrently ensuring the equitable distribution of computation-intensive tasks across groups. (4) Ms-Stream deploys task groups to compute nodes with varying resource capacities following the descending maximum padding priority rule for a balanced workload. Performance metrics such as system throughput and latency are evaluated with real-world streaming applications. Experimental results demonstrate the significant improvements made by Ms-Stream, reducing maximum system latency by 61% and increasing maximum throughput by more than 2x compared to existing state-of-the-art works.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.