Pub Date : 2024-03-27DOI: 10.1109/TCC.2024.3382132
Xinglong Diao;Huaxi Gu;Wenting Wei;Guoyong Jiang;Baochun Li
Flowlet switching has been proven to be an effective technology for fine-grained load balancing in data center networks. However, flowlet detection based on static flowlet timeout values, lacks accuracy and effectiveness in complex network environments. In this article, we propose a new deep reinforcement learning approach, called DRLet, to dynamically detect flowlets. DRLet offers two advantages: first, it provides dynamic flowlet timeout values to detect bursts into fine-grained flowlets; second, flowlet timeout values are automatically configured by the deep reinforcement learning agent, which only requires simple and measurable network states, instead of any prior knowledge, to achieve the pre-defined goal. With our approach, the flowlet timeout value dynamically matches the network load scenario, ensuring the accuracy and effectiveness of flowlet detection while suppressing packet reordering. Our results show that DRLet achieves superior performance compared to existing schemes based on static flowlet timeout values in both baseline and asymmetric topologies.
{"title":"Deep Reinforcement Learning Based Dynamic Flowlet Switching for DCN","authors":"Xinglong Diao;Huaxi Gu;Wenting Wei;Guoyong Jiang;Baochun Li","doi":"10.1109/TCC.2024.3382132","DOIUrl":"10.1109/TCC.2024.3382132","url":null,"abstract":"Flowlet switching has been proven to be an effective technology for fine-grained load balancing in data center networks. However, flowlet detection based on static flowlet timeout values, lacks accuracy and effectiveness in complex network environments. In this article, we propose a new deep reinforcement learning approach, called DRLet, to dynamically detect flowlets. DRLet offers two advantages: first, it provides dynamic flowlet timeout values to detect bursts into fine-grained flowlets; second, flowlet timeout values are automatically configured by the deep reinforcement learning agent, which only requires simple and measurable network states, instead of any prior knowledge, to achieve the pre-defined goal. With our approach, the flowlet timeout value dynamically matches the network load scenario, ensuring the accuracy and effectiveness of flowlet detection while suppressing packet reordering. Our results show that DRLet achieves superior performance compared to existing schemes based on static flowlet timeout values in both baseline and asymmetric topologies.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collaborative edge computing (CEC) is an emerging computing paradigm in which edge nodes collaborate to perform tasks from end devices. Task offloading decides when and at which edge node tasks are executed. Most existing studies assume task profiles and network conditions are known in advance, which can hardly adapt to dynamic real-world computation environments. Some learning-based methods use online task offloading without considering task dependency and network flow scheduling, leading to underutilized resources and flow congestion. We study Online Dependent Task Offloading (ODTO) in CEC, jointly optimizing network flow scheduling to optimize quality of service by reducing task completion time and energy consumption. The challenge of ODTO lies in how to offload dependent tasks and schedule network flows in dynamic networks. We model ODTO as the Markov Decision Process (MDP) and propose an Asynchronous Deep Progressive Reinforcement Learning (ADPRL) approach that optimize offloading and bandwidth decisions. We design a novel dependency-aware reward mechanism to address task dependency and dynamic network. Extensive experiments on the Alibaba cluster trace dataset and synthetic dataset indicate that our algorithm outperforms heuristic and learning-based methods in average task completion time and energy consumption.
{"title":"Dynamic Task Offloading in Edge Computing Based on Dependency-Aware Reinforcement Learning","authors":"Xiangchun Chen;Jiannong Cao;Yuvraj Sahni;Shan Jiang;Zhixuan Liang","doi":"10.1109/TCC.2024.3381646","DOIUrl":"10.1109/TCC.2024.3381646","url":null,"abstract":"Collaborative edge computing (CEC) is an emerging computing paradigm in which edge nodes collaborate to perform tasks from end devices. Task offloading decides when and at which edge node tasks are executed. Most existing studies assume task profiles and network conditions are known in advance, which can hardly adapt to dynamic real-world computation environments. Some learning-based methods use online task offloading without considering task dependency and network flow scheduling, leading to underutilized resources and flow congestion. We study Online Dependent Task Offloading (ODTO) in CEC, jointly optimizing network flow scheduling to optimize quality of service by reducing task completion time and energy consumption. The challenge of ODTO lies in how to offload dependent tasks and schedule network flows in dynamic networks. We model ODTO as the Markov Decision Process (MDP) and propose an Asynchronous Deep Progressive Reinforcement Learning (ADPRL) approach that optimize offloading and bandwidth decisions. We design a novel dependency-aware reward mechanism to address task dependency and dynamic network. Extensive experiments on the Alibaba cluster trace dataset and synthetic dataset indicate that our algorithm outperforms heuristic and learning-based methods in average task completion time and energy consumption.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The market penetration of Infrastructure-as-a-Service (IaaS) in cloud computing is increasing benefiting from its flexibility and scalability. One of the most important issues for IaaS cloud service providers is to minimize the monetary cost while meeting cloud user experience requirements such as makespan and security. Prior works on cloud service cost minimization ignore either security or makespan which is very important for user experience. In this article, we propose a two-stage algorithm to solve the cloud service cost minimization problem at the premise of satisfying the security and makespan requirements of cloud users. Specifically, in the first stage, we propose a novel security service selection scheme to ensure system security by judiciously selecting security services with low cost for tasks under the constraints of time and security. In the second stage, to further reduce the cloud service cost, we design a workflow scheduling method based on an improved firefly algorithm (IFA). The IFA-based method schedules cloud service workflows to virtual machines of small cost at the premise of guaranteeing security and makespan. It can quickly find the workflow scheduling solution with minimized cost using our designed updating scheme and mapping operator. Extensive simulations are conducted on real-world workflows to verify the efficacy of the proposed two-stage method. Simulation results show that the proposed two-stage method outperforms the baseline and two benchmarking methods in terms of cost minimization without violating security and time constraints. Compared to benchmarking methods, the cloud service cost can be reduced by up to 57.6% by using our proposed approach.
基础设施即服务(IaaS)的灵活性和可扩展性使其在云计算领域的市场渗透率不断提高。对于 IaaS 云服务提供商来说,最重要的问题之一是在满足云用户体验要求(如正常运行时间和安全性)的同时最大限度地降低货币成本。之前关于云服务成本最小化的研究忽略了安全性或正常运行时间,而正常运行时间对用户体验非常重要。在本文中,我们提出了一种两阶段算法,在满足云用户安全性和时延要求的前提下解决云服务成本最小化问题。具体来说,在第一阶段,我们提出了一种新颖的安全服务选择方案,在时间和安全的约束下,为任务明智地选择成本低的安全服务,以确保系统安全。在第二阶段,为了进一步降低云服务成本,我们设计了一种基于改进萤火虫算法(IFA)的工作流调度方法。基于 IFA 的方法在保证安全和有效期的前提下,将云服务工作流调度到成本较小的虚拟机上。利用我们设计的更新方案和映射算子,它能快速找到成本最小的工作流调度方案。我们在实际工作流中进行了大量仿真,以验证所提出的两阶段方法的有效性。仿真结果表明,在不违反安全性和时间限制的前提下,所提出的两阶段方法在成本最小化方面优于基准方法和两种基准方法。与基准方法相比,使用我们提出的方法,云服务成本最多可降低 57.6%。
{"title":"Makespan and Security-Aware Workflow Scheduling for Cloud Service Cost Minimization","authors":"Liying Li;Chengliang Zhou;Peijin Cong;Yufan Shen;Junlong Zhou;Tongquan Wei","doi":"10.1109/TCC.2024.3382351","DOIUrl":"10.1109/TCC.2024.3382351","url":null,"abstract":"The market penetration of Infrastructure-as-a-Service (IaaS) in cloud computing is increasing benefiting from its flexibility and scalability. One of the most important issues for IaaS cloud service providers is to minimize the monetary cost while meeting cloud user experience requirements such as makespan and security. Prior works on cloud service cost minimization ignore either security or makespan which is very important for user experience. In this article, we propose a two-stage algorithm to solve the cloud service cost minimization problem at the premise of satisfying the security and makespan requirements of cloud users. Specifically, in the first stage, we propose a novel security service selection scheme to ensure system security by judiciously selecting security services with low cost for tasks under the constraints of time and security. In the second stage, to further reduce the cloud service cost, we design a workflow scheduling method based on an improved firefly algorithm (IFA). The IFA-based method schedules cloud service workflows to virtual machines of small cost at the premise of guaranteeing security and makespan. It can quickly find the workflow scheduling solution with minimized cost using our designed updating scheme and mapping operator. Extensive simulations are conducted on real-world workflows to verify the efficacy of the proposed two-stage method. Simulation results show that the proposed two-stage method outperforms the baseline and two benchmarking methods in terms of cost minimization without violating security and time constraints. Compared to benchmarking methods, the cloud service cost can be reduced by up to 57.6% by using our proposed approach.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pre-copy-based Virtual Machine (VM) live migration seamlessly migrates the running VM to the target physical server by pre-copying memory pages and realizing updates through loop iterations. This method, which has high reliability and robustness, can effectively achieve load balancing and reduce energy consumption. It is widely used in the industry to manage server cluster resources. However, it also involves many problems, such as many dirty memory pages resulting from repeated transmission and convergence failure of iterative transmission. Hence, pre-copy live migration cannot efficiently allocate server cluster resources. To resolve these problems, a VM pre-copy live migration technology based on the similarity of dirty memory pages is proposed in this paper. The access priority of historical dirty memory pages was determined by calculating the similarity weight based on the Hamming distance. A priority-based delay transmission scheme for high dirty pages and low dirty pages was used to decrease the frequent transmission of high dirty memory pages, increase the convergence speed of the live-migration iterative copy process, and reduce the overall migration time of VMs. A comparative analysis of experimental results based on six dimensions showed that the proposed method achieved better migration efficiency than the conventional live migration strategy.
{"title":"Live Migration of Virtual Machines Based on Dirty Page Similarity","authors":"Yucong Chen;Shuaixin Xu;Hubin Yang;Rui Zhou;Deke Guo;Qingguo Zhou","doi":"10.1109/TCC.2024.3379494","DOIUrl":"10.1109/TCC.2024.3379494","url":null,"abstract":"Pre-copy-based Virtual Machine (VM) live migration seamlessly migrates the running VM to the target physical server by pre-copying memory pages and realizing updates through loop iterations. This method, which has high reliability and robustness, can effectively achieve load balancing and reduce energy consumption. It is widely used in the industry to manage server cluster resources. However, it also involves many problems, such as many dirty memory pages resulting from repeated transmission and convergence failure of iterative transmission. Hence, pre-copy live migration cannot efficiently allocate server cluster resources. To resolve these problems, a VM pre-copy live migration technology based on the similarity of dirty memory pages is proposed in this paper. The access priority of historical dirty memory pages was determined by calculating the similarity weight based on the Hamming distance. A priority-based delay transmission scheme for high dirty pages and low dirty pages was used to decrease the frequent transmission of high dirty memory pages, increase the convergence speed of the live-migration iterative copy process, and reduce the overall migration time of VMs. A comparative analysis of experimental results based on six dimensions showed that the proposed method achieved better migration efficiency than the conventional live migration strategy.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":null,"pages":null},"PeriodicalIF":6.5,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140200145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-20DOI: 10.1109/TCC.2024.3403175
Myoungsung You;Minjae Seo;Jaehan Kim;Seungwon Shin;Jaehyun Nam
Containers have become the predominant virtualization technique for deploying microservices in cloud environments. However, container networking, critical for microservice functionality, often introduces significant overhead and resource consumption, potentially degrading the performance of microservices. This challenge arises from the complexity of the software-based network data plane, responsible for network virtualization and access control within container traffic. To tackle this challenge, we propose Hyperion