大数据处理系统中无先验信息的任务调度

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2017-06-05 DOI:10.1109/ICDCS.2017.105

Zhiming Hu, Baochun Li, Zheng Qin, Rick Siow Mong Goh

{"title":"大数据处理系统中无先验信息的任务调度","authors":"Zhiming Hu, Baochun Li, Zheng Qin, Rick Siow Mong Goh","doi":"10.1109/ICDCS.2017.105","DOIUrl":null,"url":null,"abstract":"Job scheduling plays an important role in improving the overall system performance in big data processing frameworks. Simple job scheduling policies, such as Fair and FIFO scheduling, do not consider job sizes and may degrade the performance when jobs of varying sizes arrive. More elaborate job scheduling policies make the convenient assumption that jobs are recurring, and complete information about their sizes is available from their prior runs. In this paper, we design and implement an efficient and practical job scheduler for big data processing systems to achieve better performance even without prior information about job sizes. The superior performance of our job scheduler originates from the design of multiple level priority queues, where jobs are demoted to lower priority queues if the amount of service consumed so far reaches a certain threshold. In this case, jobs in need of a small amount of service can finish in the topmost several levels of queues, while jobs that need a large amount of service to complete are moved to lower priority queues to avoid head-of-line blocking. Our new job scheduler can effectively mimic the shortest job first scheduling policy without knowing the job sizes in advance. To demonstrate its performance, we have implemented our new job scheduler in YARN, a popular resource manager used by Hadoop/Spark, and validated its performance with both experiments on real datasets and large-scale trace-driven simulations. Our experimental and simulation results have strongly confirmed the effectiveness of our design: our new job scheduler can reduce the average job response time of the Fair scheduler by up to 45%.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"Job Scheduling without Prior Information in Big Data Processing Systems\",\"authors\":\"Zhiming Hu, Baochun Li, Zheng Qin, Rick Siow Mong Goh\",\"doi\":\"10.1109/ICDCS.2017.105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Job scheduling plays an important role in improving the overall system performance in big data processing frameworks. Simple job scheduling policies, such as Fair and FIFO scheduling, do not consider job sizes and may degrade the performance when jobs of varying sizes arrive. More elaborate job scheduling policies make the convenient assumption that jobs are recurring, and complete information about their sizes is available from their prior runs. In this paper, we design and implement an efficient and practical job scheduler for big data processing systems to achieve better performance even without prior information about job sizes. The superior performance of our job scheduler originates from the design of multiple level priority queues, where jobs are demoted to lower priority queues if the amount of service consumed so far reaches a certain threshold. In this case, jobs in need of a small amount of service can finish in the topmost several levels of queues, while jobs that need a large amount of service to complete are moved to lower priority queues to avoid head-of-line blocking. Our new job scheduler can effectively mimic the shortest job first scheduling policy without knowing the job sizes in advance. To demonstrate its performance, we have implemented our new job scheduler in YARN, a popular resource manager used by Hadoop/Spark, and validated its performance with both experiments on real datasets and large-scale trace-driven simulations. Our experimental and simulation results have strongly confirmed the effectiveness of our design: our new job scheduler can reduce the average job response time of the Fair scheduler by up to 45%.\",\"PeriodicalId\":127689,\"journal\":{\"name\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS.2017.105\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

在大数据处理框架中，作业调度在提高系统整体性能方面发挥着重要作用。简单的作业调度策略（如公平调度和先进先出调度）不考虑作业大小，当不同大小的作业到达时可能会降低性能。更复杂的作业调度策略则方便地假设作业是重复出现的，并且可以从作业之前的运行中获得有关作业大小的完整信息。在本文中，我们为大数据处理系统设计并实现了一种高效实用的作业调度程序，即使在没有作业规模相关信息的情况下也能获得更好的性能。我们的作业调度程序的优越性能源于多级优先队列的设计，在多级优先队列中，如果迄今为止所消耗的服务量达到一定阈值，作业就会被降级到较低优先级队列。在这种情况下，需要少量服务的作业可以在最上层的几级队列中完成，而需要大量服务才能完成的作业则会被移到较低优先级的队列中，以避免头部阻塞。我们的新作业调度器可以有效地模仿最短作业优先调度策略，而无需提前知道作业大小。为了证明新作业调度程序的性能，我们在 Hadoop/Spark 使用的流行资源管理器 YARN 中实施了新作业调度程序，并通过真实数据集实验和大规模跟踪仿真验证了其性能。我们的实验和仿真结果有力地证实了我们设计的有效性：我们的新作业调度程序可以将公平调度程序的平均作业响应时间最多缩短 45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Job Scheduling without Prior Information in Big Data Processing Systems

Job scheduling plays an important role in improving the overall system performance in big data processing frameworks. Simple job scheduling policies, such as Fair and FIFO scheduling, do not consider job sizes and may degrade the performance when jobs of varying sizes arrive. More elaborate job scheduling policies make the convenient assumption that jobs are recurring, and complete information about their sizes is available from their prior runs. In this paper, we design and implement an efficient and practical job scheduler for big data processing systems to achieve better performance even without prior information about job sizes. The superior performance of our job scheduler originates from the design of multiple level priority queues, where jobs are demoted to lower priority queues if the amount of service consumed so far reaches a certain threshold. In this case, jobs in need of a small amount of service can finish in the topmost several levels of queues, while jobs that need a large amount of service to complete are moved to lower priority queues to avoid head-of-line blocking. Our new job scheduler can effectively mimic the shortest job first scheduling policy without knowing the job sizes in advance. To demonstrate its performance, we have implemented our new job scheduler in YARN, a popular resource manager used by Hadoop/Spark, and validated its performance with both experiments on real datasets and large-scale trace-driven simulations. Our experimental and simulation results have strongly confirmed the effectiveness of our design: our new job scheduler can reduce the average job response time of the Fair scheduler by up to 45%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量