{"title":"SpotDAG:基于 RL 的异构云环境中 DAG 工作流调度算法","authors":"Liduo Lin;Li Pan;Shijun Liu","doi":"10.1109/TSC.2024.3422828","DOIUrl":null,"url":null,"abstract":"As increasingly complex functions are implemented in applications, directed acyclic graphs (DAGs) are widely used to model the inter-dependencies between individual functions. Cloud-based data processing platforms need to consider the complex topology of DAGs and arbitrary deadlines given by users for job scheduling, leading to an NP-hard decision-making problem. Leveraging spot instances in data processing platforms can achieve significant cost savings, but the unpredictable interruption of spot instances makes the problem of VM scaling and job scheduling more difficult. In this paper, a Reinforcement Learning (RL) based approach called SpotDAG is proposed to solve the auto-scaling problem for jobs modeled as DAGs on a data processing platform where spot instances are introduced. SpotDAG makes cluster scaling and job scheduling decisions at the same time by mapping its output to several meta-policies. This paper introduces the self-attention mechanism for feature extraction to help the intelligent agent learn faster. A mask layer after the output of the proposed RL-based algorithm circumvents illegal actions to ensure that a job is completed by its deadline. Extensive experimental results show that the proposed approach can significantly reduce the cost of instances for data processing platforms while ensuring that jobs are completed in time.","PeriodicalId":13255,"journal":{"name":"IEEE Transactions on Services Computing","volume":"17 5","pages":"2904-2917"},"PeriodicalIF":5.8000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SpotDAG: An RL-Based Algorithm for DAG Workflow Scheduling in Heterogeneous Cloud Environments\",\"authors\":\"Liduo Lin;Li Pan;Shijun Liu\",\"doi\":\"10.1109/TSC.2024.3422828\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As increasingly complex functions are implemented in applications, directed acyclic graphs (DAGs) are widely used to model the inter-dependencies between individual functions. Cloud-based data processing platforms need to consider the complex topology of DAGs and arbitrary deadlines given by users for job scheduling, leading to an NP-hard decision-making problem. Leveraging spot instances in data processing platforms can achieve significant cost savings, but the unpredictable interruption of spot instances makes the problem of VM scaling and job scheduling more difficult. In this paper, a Reinforcement Learning (RL) based approach called SpotDAG is proposed to solve the auto-scaling problem for jobs modeled as DAGs on a data processing platform where spot instances are introduced. SpotDAG makes cluster scaling and job scheduling decisions at the same time by mapping its output to several meta-policies. This paper introduces the self-attention mechanism for feature extraction to help the intelligent agent learn faster. A mask layer after the output of the proposed RL-based algorithm circumvents illegal actions to ensure that a job is completed by its deadline. Extensive experimental results show that the proposed approach can significantly reduce the cost of instances for data processing platforms while ensuring that jobs are completed in time.\",\"PeriodicalId\":13255,\"journal\":{\"name\":\"IEEE Transactions on Services Computing\",\"volume\":\"17 5\",\"pages\":\"2904-2917\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Services Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10584150/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Services Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10584150/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
随着应用中实现的功能越来越复杂,有向无环图(DAG)被广泛用于模拟各个功能之间的相互依赖关系。基于云的数据处理平台需要考虑 DAG 的复杂拓扑结构和用户为作业调度指定的任意截止日期,这就导致了一个 NP 难决策问题。在数据处理平台中利用点实例可以大大节约成本,但点实例不可预测的中断性使得虚拟机扩展和作业调度问题变得更加困难。本文提出了一种名为 SpotDAG 的基于强化学习(RL)的方法,用于解决数据处理平台上以 DAG 为模型的作业自动缩放问题。SpotDAG 通过将其输出映射到多个元策略,同时做出集群扩展和作业调度决策。本文介绍了用于特征提取的自我注意机制,以帮助智能代理更快地学习。在基于 RL 的算法输出后有一个掩码层,可规避非法操作,确保作业在截止日期前完成。广泛的实验结果表明,所提出的方法可以显著降低数据处理平台的实例成本,同时确保作业按时完成。
SpotDAG: An RL-Based Algorithm for DAG Workflow Scheduling in Heterogeneous Cloud Environments
As increasingly complex functions are implemented in applications, directed acyclic graphs (DAGs) are widely used to model the inter-dependencies between individual functions. Cloud-based data processing platforms need to consider the complex topology of DAGs and arbitrary deadlines given by users for job scheduling, leading to an NP-hard decision-making problem. Leveraging spot instances in data processing platforms can achieve significant cost savings, but the unpredictable interruption of spot instances makes the problem of VM scaling and job scheduling more difficult. In this paper, a Reinforcement Learning (RL) based approach called SpotDAG is proposed to solve the auto-scaling problem for jobs modeled as DAGs on a data processing platform where spot instances are introduced. SpotDAG makes cluster scaling and job scheduling decisions at the same time by mapping its output to several meta-policies. This paper introduces the self-attention mechanism for feature extraction to help the intelligent agent learn faster. A mask layer after the output of the proposed RL-based algorithm circumvents illegal actions to ensure that a job is completed by its deadline. Extensive experimental results show that the proposed approach can significantly reduce the cost of instances for data processing platforms while ensuring that jobs are completed in time.
期刊介绍:
IEEE Transactions on Services Computing encompasses the computing and software aspects of the science and technology of services innovation research and development. It places emphasis on algorithmic, mathematical, statistical, and computational methods central to services computing. Topics covered include Service Oriented Architecture, Web Services, Business Process Integration, Solution Performance Management, and Services Operations and Management. The transactions address mathematical foundations, security, privacy, agreement, contract, discovery, negotiation, collaboration, and quality of service for web services. It also covers areas like composite web service creation, business and scientific applications, standards, utility models, business process modeling, integration, collaboration, and more in the realm of Services Computing.