{"title":"章鱼:基于地理分布式大数据分析集群的拥塞感知调度","authors":"Haizhou Du, Keke Zhang, Zhenchen Yang","doi":"10.1109/ICSAI.2018.8599476","DOIUrl":null,"url":null,"abstract":"In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.","PeriodicalId":375852,"journal":{"name":"2018 5th International Conference on Systems and Informatics (ICSAI)","volume":"31 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster\",\"authors\":\"Haizhou Du, Keke Zhang, Zhenchen Yang\",\"doi\":\"10.1109/ICSAI.2018.8599476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.\",\"PeriodicalId\":375852,\"journal\":{\"name\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"volume\":\"31 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Systems and Informatics (ICSAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSAI.2018.8599476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Systems and Informatics (ICSAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSAI.2018.8599476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster
In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.