章鱼:基于地理分布式大数据分析集群的拥塞感知调度

2018 5th International Conference on Systems and Informatics (ICSAI) Pub Date : 2018-11-01 DOI:10.1109/ICSAI.2018.8599476

Haizhou Du, Keke Zhang, Zhenchen Yang

{"title":"章鱼:基于地理分布式大数据分析集群的拥塞感知调度","authors":"Haizhou Du, Keke Zhang, Zhenchen Yang","doi":"10.1109/ICSAI.2018.8599476","DOIUrl":null,"url":null,"abstract":"In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.","PeriodicalId":375852,"journal":{"name":"2018 5th International Conference on Systems and Informatics (ICSAI)","volume":"31 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster\",\"authors\":\"Haizhou Du, Keke Zhang, Zhenchen Yang\",\"doi\":\"10.1109/ICSAI.2018.8599476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.\",\"PeriodicalId\":375852,\"journal\":{\"name\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"volume\":\"31 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSAI.2018.8599476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI.2018.8599476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，大数据分析框架如雨后春笋般涌现。与此同时，跨地理分布的数据中心生成、存储和处理大量数据已成为惯例。在地理分布环境下，网络间数据传输产生的网络拥塞成为影响系统整体性能的主要瓶颈。现有的许多方法通常是在网络拥塞发生后才进行处理，这并不能从根本上解决问题。在本文中，我们重点研究了在Apache Spark的地理分布式环境中，提前预测和避免网络拥塞的问题，在他们的任务完成时间方面。我们将此问题表述为运行时最小化问题，由于具有不同数据中心的场景，该问题在实践中具有挑战性。为了解决这些挑战，我们提出了一个基于拥塞感知调度的模型。在模型中，我们利用SDN(Software-Defined Networking，软件定义网络)提前检测来自不同数据中心的数据流的数据量，然后分析数据特征，提前预测可能产生网络拥塞的流量，从而针对不同的流量拟定两种方案。此外，当我们检测到网络拥塞时，我们为拥塞流选择带宽更大的路径。该方法可以最大限度地减少网络拥塞，提高网络利用率，提高地理分布式环境下的系统性能。作为本文的重点，我们设计并实现了基于Apache Spark(一个现代数据处理框架)的作业调度方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster

In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 5th International Conference on Systems and Informatics (ICSAI)

自引率

0.00%

发文量