首页 > 最新文献

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献

英文 中文
ServerMore
Amoghavarsha Suresh, Anshul Gandhi
Serverless computing allows customers to submit their jobs to the cloud for execution, with the resource provisioning being taken care of by the cloud provider. Serverless functions are often short-lived and have modest resource requirements, thereby presenting an opportunity to improve server utilization by colocating with latency-sensitive customer workloads. This paper presents ServerMore, a server-level resource manager that opportunistically colocates customer serverless jobs with serverful customer VMs. ServerMore dynamically regulates the CPU, memory bandwidth, and LLC resources on the server to ensure that the colocation between serverful and serverless workloads does not impact application tail latencies. By selectively admitting serverless functions and inferring the performance of black-box serverful workloads, ServerMore improves resource utilization on average by 35.9% to 245% compared to prior works; while having a minimal impact on the latency of both serverful applications and serverless functions.
{"title":"ServerMore","authors":"Amoghavarsha Suresh, Anshul Gandhi","doi":"10.1145/3472883.3486979","DOIUrl":"https://doi.org/10.1145/3472883.3486979","url":null,"abstract":"Serverless computing allows customers to submit their jobs to the cloud for execution, with the resource provisioning being taken care of by the cloud provider. Serverless functions are often short-lived and have modest resource requirements, thereby presenting an opportunity to improve server utilization by colocating with latency-sensitive customer workloads. This paper presents ServerMore, a server-level resource manager that opportunistically colocates customer serverless jobs with serverful customer VMs. ServerMore dynamically regulates the CPU, memory bandwidth, and LLC resources on the server to ensure that the colocation between serverful and serverless workloads does not impact application tail latencies. By selectively admitting serverless functions and inferring the performance of black-box serverful workloads, ServerMore improves resource utilization on average by 35.9% to 245% compared to prior works; while having a minimal impact on the latency of both serverful applications and serverless functions.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83098300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Portkey 门钥匙
Joseph Noor, M. Srivastava, R. Netravali
Owing to a need for low latency data accesses, emerging IoT and mobile applications commonly require distributed data stores (e.g., key-value or KV stores) to operate entirely at the network's edge. Unfortunately, existing KV stores employ randomized data placement policies (e.g., consistent hashing) that ignore the client mobility and resulting variance in client-server latencies that are inherent to edge applications---the effect is largely suboptimal and inefficient data placement. We present Portkey, a distributed KV store that dynamically adapts data placement according to time-varying client mobility and data access patterns. The key insight with Portkey is to lean into the inherent mobility and prioritize rapid but approximate placement decisions over delayed optimal ones. Doing so enables the efficient tracking of client-server latencies despite edge resource constraints, and the use of greedy placement heuristics that are self-correcting over short timescales. Results with a realistic autonomous vehicle dataset and two small-scale deployments reveal that Portkey reduces average and tail request latency by 21-82% and 26-77% compared to existing placement strategies.
{"title":"Portkey","authors":"Joseph Noor, M. Srivastava, R. Netravali","doi":"10.1145/3472883.3487004","DOIUrl":"https://doi.org/10.1145/3472883.3487004","url":null,"abstract":"Owing to a need for low latency data accesses, emerging IoT and mobile applications commonly require distributed data stores (e.g., key-value or KV stores) to operate entirely at the network's edge. Unfortunately, existing KV stores employ randomized data placement policies (e.g., consistent hashing) that ignore the client mobility and resulting variance in client-server latencies that are inherent to edge applications---the effect is largely suboptimal and inefficient data placement. We present Portkey, a distributed KV store that dynamically adapts data placement according to time-varying client mobility and data access patterns. The key insight with Portkey is to lean into the inherent mobility and prioritize rapid but approximate placement decisions over delayed optimal ones. Doing so enables the efficient tracking of client-server latencies despite edge resource constraints, and the use of greedy placement heuristics that are self-correcting over short timescales. Results with a realistic autonomous vehicle dataset and two small-scale deployments reveal that Portkey reduces average and tail request latency by 21-82% and 26-77% compared to existing placement strategies.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83672806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Tell me when you are sleepy and what may wake you up! 告诉我你什么时候困了,什么会把你叫醒!
Djob Mvondo, A. Barbalace, A. Tchana, Gilles Muller
Nowadays, there is a shift in the deployment model of Cloud and Edge applications. Applications are now deployed as a set of several small units communicating with each other - the microservice model. Moreover, each unit - a microservice, may be implemented as a virtual machine, container, function, etc., spanning the different Cloud and Edge service models including IaaS, PaaS, FaaS. A microservice is instantiated upon the reception of a request (e.g., an http packet or a trigger), and a rack-level or data-center-level scheduler decides the placement for such unit of execution considering for example data locality and load balancing. With such a configuration, it is common to encounter scenarios where different units, as well as multiple instances of the same unit, may be running on a single server at the same time. When multiple microservices are running on the same server not necessarily all of them are doing actual processing, some may be busy-waiting - i.e., waiting for events (or requests) sent by other units. However, these "idle" units are consuming CPU time which could be used by other running units or cloud utility functions on the server (e.g., monitoring daemons). In a controlled experiment, we observe that units can spend up to 20% - 55% of their CPU time waiting, thus a great amount of CPU time is wasted; these values significantly grow when overcommitting CPU resources (i.e., units CPU reservations exceed server CPU capacity), where we observe up to 69% - 75%. This is a result of the lack of information/context about what is running in each unit from the server CPU scheduler perspective. In this paper, we first provide evidence of the problem and discuss several research questions. Then, we propose an handful of solutions worth exploring that consists in revisiting hypervisor and host OS scheduler designs to reduce the CPU time wasted on idle units. Our proposal leverages the concepts of informed scheduling, and monitoring for internal and external events. Based on the aforementioned solutions, we propose our initial implementation on Linux/KVM.
如今,云和边缘应用程序的部署模型发生了转变。应用程序现在被部署为一组相互通信的小单元——微服务模型。此外,每个微服务单元都可以作为虚拟机、容器、功能等实现,跨越不同的云和边缘服务模型,包括IaaS、PaaS、FaaS。微服务在接收请求(例如,http数据包或触发器)时被实例化,并且机架级或数据中心级调度器考虑例如数据局部性和负载平衡来决定此类执行单元的位置。使用这样的配置,通常会遇到不同单元以及同一单元的多个实例可能同时在单个服务器上运行的情况。当多个微服务在同一台服务器上运行时,不一定所有的微服务都在进行实际的处理,有些微服务可能在忙等待——即等待其他单元发送的事件(或请求)。然而,这些“空闲”单元正在消耗CPU时间,而这些时间本可以由服务器上的其他运行单元或云实用程序功能使用(例如,监视守护进程)。在一个对照实验中,我们观察到单元可以花费高达20% - 55%的CPU时间等待,从而浪费了大量的CPU时间;当过度使用CPU资源时(即,单位CPU预留超过服务器CPU容量),这些值显著增长,我们观察到高达69% - 75%。这是因为从服务器CPU调度器的角度来看,缺乏关于每个单元中正在运行的内容的信息/上下文。在本文中,我们首先提供了问题的证据,并讨论了几个研究问题。然后,我们提出了一些值得探索的解决方案,这些解决方案包括重新审视管理程序和主机操作系统调度器的设计,以减少在空闲单元上浪费的CPU时间。我们的建议利用了知情调度的概念,以及对内部和外部事件的监控。基于上述解决方案,我们提出了在Linux/KVM上的初步实现。
{"title":"Tell me when you are sleepy and what may wake you up!","authors":"Djob Mvondo, A. Barbalace, A. Tchana, Gilles Muller","doi":"10.1145/3472883.3487013","DOIUrl":"https://doi.org/10.1145/3472883.3487013","url":null,"abstract":"Nowadays, there is a shift in the deployment model of Cloud and Edge applications. Applications are now deployed as a set of several small units communicating with each other - the microservice model. Moreover, each unit - a microservice, may be implemented as a virtual machine, container, function, etc., spanning the different Cloud and Edge service models including IaaS, PaaS, FaaS. A microservice is instantiated upon the reception of a request (e.g., an http packet or a trigger), and a rack-level or data-center-level scheduler decides the placement for such unit of execution considering for example data locality and load balancing. With such a configuration, it is common to encounter scenarios where different units, as well as multiple instances of the same unit, may be running on a single server at the same time. When multiple microservices are running on the same server not necessarily all of them are doing actual processing, some may be busy-waiting - i.e., waiting for events (or requests) sent by other units. However, these \"idle\" units are consuming CPU time which could be used by other running units or cloud utility functions on the server (e.g., monitoring daemons). In a controlled experiment, we observe that units can spend up to 20% - 55% of their CPU time waiting, thus a great amount of CPU time is wasted; these values significantly grow when overcommitting CPU resources (i.e., units CPU reservations exceed server CPU capacity), where we observe up to 69% - 75%. This is a result of the lack of information/context about what is running in each unit from the server CPU scheduler perspective. In this paper, we first provide evidence of the problem and discuss several research questions. Then, we propose an handful of solutions worth exploring that consists in revisiting hypervisor and host OS scheduler designs to reduce the CPU time wasted on idle units. Our proposal leverages the concepts of informed scheduling, and monitoring for internal and external events. Based on the aforementioned solutions, we propose our initial implementation on Linux/KVM.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85597849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
OptDebug
Muhammad Ali Gulzar, Miryung Kim
Fault-isolation is extremely challenging in large scale data processing in cloud environments. Data provenance is a dominant existing approach to isolate data records responsible for a given output. However, data provenance concerns fault isolation only in the data-space, as opposed to fault isolation in the code-space---how can we precisely localize operations or APIs responsible for a given suspicious or incorrect result? We present OptDebug that identifies fault-inducing operations in a dataflow application using three insights. First, debugging is easier with a small-scale input than a large-scale input. So it uses data provenance to simplify the original input records to a smaller set leading to test failures and test successes. Second, keeping track of operation provenance is crucial for debugging. Thus, it leverages automated taint analysis to propagate the lineage of operations downstream with individual records. Lastly, each operation may contribute to test failures to a different degree. Thus OptDebug ranks each operation's spectra---the relative participation frequency in failing vs. passing tests. In our experiments, OptDebug achieves 100% recall and 86% precision in terms of detecting faulty operations and reduces the debugging time by 17x compared to a naïve approach. Overall, OptDebug shows great promise in improving developer productivity in today's complex data processing pipelines by obviating the need to re-execute the program repetitively with different inputs and manually examine program traces to isolate buggy code.
{"title":"OptDebug","authors":"Muhammad Ali Gulzar, Miryung Kim","doi":"10.1145/3472883.3487016","DOIUrl":"https://doi.org/10.1145/3472883.3487016","url":null,"abstract":"Fault-isolation is extremely challenging in large scale data processing in cloud environments. Data provenance is a dominant existing approach to isolate data records responsible for a given output. However, data provenance concerns fault isolation only in the data-space, as opposed to fault isolation in the code-space---how can we precisely localize operations or APIs responsible for a given suspicious or incorrect result? We present OptDebug that identifies fault-inducing operations in a dataflow application using three insights. First, debugging is easier with a small-scale input than a large-scale input. So it uses data provenance to simplify the original input records to a smaller set leading to test failures and test successes. Second, keeping track of operation provenance is crucial for debugging. Thus, it leverages automated taint analysis to propagate the lineage of operations downstream with individual records. Lastly, each operation may contribute to test failures to a different degree. Thus OptDebug ranks each operation's spectra---the relative participation frequency in failing vs. passing tests. In our experiments, OptDebug achieves 100% recall and 86% precision in terms of detecting faulty operations and reduces the debugging time by 17x compared to a naïve approach. Overall, OptDebug shows great promise in improving developer productivity in today's complex data processing pipelines by obviating the need to re-execute the program repetitively with different inputs and manually examine program traces to isolate buggy code.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79374818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
OneEdge OneEdge
Enrique Saurez, H. Gupta, Alexandros Daglis, U. Ramachandran
Resource management for geo-distributed infrastructures is challenging due to the scarcity and non-uniformity of edge resources, as well as the high client mobility and workload surges inherent to situation awareness applications. Due to their centralized nature, state-of-the-art schedulers that work well in datacenters lack the performance and feature requirements of such applications. We present OneEdge, a hybrid control plane that enables autonomous decision-making at edge sites for localized, rapid single-site application deployment. Edge sites handle mobility, churn, and load spikes, by cooperating with a centralized controller that allows coordinated multi-site scheduling and dynamic reconfiguration. OneEdge's scheduling decisions are driven by each application's end-to-end service level objective (E2E SLO) as well as the specific requirements of situation awareness applications. OneEdge's novel distributed state management combines autonomous decision-making at the edge sites for rapid localized resource allocations with decision-making at the central controller when multi-site application deployment is needed. Using a mix of applications on multi-region Azure instances, we show that, in contrast to centralized or fully distributed control planes, OneEdge caters to the unique requirements of situation awareness applications. Compared to a centralized control plane, OneEdge reduces deployment latency by 66% for single-site applications, without compromising E2E SLOs.
{"title":"OneEdge","authors":"Enrique Saurez, H. Gupta, Alexandros Daglis, U. Ramachandran","doi":"10.1145/3472883.3487008","DOIUrl":"https://doi.org/10.1145/3472883.3487008","url":null,"abstract":"Resource management for geo-distributed infrastructures is challenging due to the scarcity and non-uniformity of edge resources, as well as the high client mobility and workload surges inherent to situation awareness applications. Due to their centralized nature, state-of-the-art schedulers that work well in datacenters lack the performance and feature requirements of such applications. We present OneEdge, a hybrid control plane that enables autonomous decision-making at edge sites for localized, rapid single-site application deployment. Edge sites handle mobility, churn, and load spikes, by cooperating with a centralized controller that allows coordinated multi-site scheduling and dynamic reconfiguration. OneEdge's scheduling decisions are driven by each application's end-to-end service level objective (E2E SLO) as well as the specific requirements of situation awareness applications. OneEdge's novel distributed state management combines autonomous decision-making at the edge sites for rapid localized resource allocations with decision-making at the central controller when multi-site application deployment is needed. Using a mix of applications on multi-region Azure instances, we show that, in contrast to centralized or fully distributed control planes, OneEdge caters to the unique requirements of situation awareness applications. Compared to a centralized control plane, OneEdge reduces deployment latency by 66% for single-site applications, without compromising E2E SLOs.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"120 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72922991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
3MileBeach
Jun Zhang, Robert Ferydouni, Aldrin Montana, Daniel Bittman, P. Alvaro
We present 3MileBeach, a tracing and fault injection platform designed for microservice-based architectures. 3Mile-Beach interposes on the message serialization libraries that are ubiquitous in this environment, avoiding the application code instrumentation that tracing and fault injection infrastructures typically require. 3MileBeach provides message-level distributed tracing at less than 50% of the overhead of the state-of-the-art tracing frameworks, and fault injection that allows higher precision experiments than existing solutions. We measure the overhead of 3MileBeach as a tracer and its efficacy as a fault injector. We qualitatively measure its promise as a platform for tuning and debugging by sharing concrete use cases in the context of bottleneck identification, performance tuning, and bug finding. Finally, we use 3MileBeach to perform a novel type of fault injection - Temporal Fault Injection (TFI), which more precisely controls individual inter-service message flow with temporal prerequisites, and makes it possible to catch an entirely new class of fault tolerance bugs.
{"title":"3MileBeach","authors":"Jun Zhang, Robert Ferydouni, Aldrin Montana, Daniel Bittman, P. Alvaro","doi":"10.1145/3472883.3486986","DOIUrl":"https://doi.org/10.1145/3472883.3486986","url":null,"abstract":"We present 3MileBeach, a tracing and fault injection platform designed for microservice-based architectures. 3Mile-Beach interposes on the message serialization libraries that are ubiquitous in this environment, avoiding the application code instrumentation that tracing and fault injection infrastructures typically require. 3MileBeach provides message-level distributed tracing at less than 50% of the overhead of the state-of-the-art tracing frameworks, and fault injection that allows higher precision experiments than existing solutions. We measure the overhead of 3MileBeach as a tracer and its efficacy as a fault injector. We qualitatively measure its promise as a platform for tuning and debugging by sharing concrete use cases in the context of bottleneck identification, performance tuning, and bug finding. Finally, we use 3MileBeach to perform a novel type of fault injection - Temporal Fault Injection (TFI), which more precisely controls individual inter-service message flow with temporal prerequisites, and makes it possible to catch an entirely new class of fault tolerance bugs.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78364275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints George:学习将长寿命容器放置在具有操作约束的大型集群中
Suyi Li, Luping Wang, Wen Wang, Yinghao Yu, Bo Li
Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention. In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.
在线云服务被广泛部署为托管在容器中的长期运行应用程序(lra)。放置LRA容器尤其具有挑战性,因为共存的容器与生产集群中的操作约束(如容错、灾难避免和增量部署)之间存在复杂的干扰。现有的调度程序通常为操作人员提供api,以手动指定容器调度需求,并且仅为容器放置提供定性的调度指南。这样的调度器在性能和规模方面都表现不佳,同时还需要人工干预。在这项工作中,我们提出了George,一个端到端的通用LRA调度器,利用最先进的强化学习(RL)技术来智能地调度LRA容器。我们首次提出了一个最优集装箱放置配方,其目标是在一组操作约束下最大化集装箱放置性能。调度的一个基本挑战是在实际操作中分类地满足不同的操作约束;具体来说,确保硬约束和软约束在预定义的阈值内被违反。结合整数线性优化技术,设计了一种新的基于投影的近端策略优化(PPPO)算法,用于在操作约束下对LRA容器进行智能调度。为了减少训练时间,我们采用迁移学习技术,利用不同LRA调度事件之间的相似性。从理论上证明了该算法的有效性、稳定性和安全性。我们将George作为Docker Swarm中的插件服务来实现。我们的内部集群证明,George可以在使用预定义的阈值强制硬约束和软约束的同时最大化LRA的性能。实验表明,George极大地提高了LRA的性能和规模,在一个有2K个容器和700台机器的大型集群中只需要不到1小时的调度时间,比现有的调度器快16倍。与最先进的替代方案相比,George还实现了26%的容器性能提升,同时降低了70%的约束违规。
{"title":"George: Learning to Place Long-Lived Containers in Large Clusters with Operation Constraints","authors":"Suyi Li, Luping Wang, Wen Wang, Yinghao Yu, Bo Li","doi":"10.1145/3472883.3486971","DOIUrl":"https://doi.org/10.1145/3472883.3486971","url":null,"abstract":"Online cloud services are widely deployed as Long-Running Applications (LRAs) hosted in containers. Placing LRA containers turns out to be particularly challenging due to the complex interference between co-located containers and the operation constraints in production clusters such as fault tolerance, disaster avoidance and incremental deployment. Existing schedulers typically provide APIs for operators to manually specify the container scheduling requirements and offer only qualitative scheduling guidelines for container placement. Such schedulers, do not perform well in terms of both performance and scale, while also requiring manual intervention. In this work, we propose George, an end-to-end generalpurpose LRA scheduler by leveraging the state-of-the-art Reinforcement Learning (RL) techniques to intelligently schedule LRA containers. We present an optimal container placement formulation for the first time with the objective of maximizing container placement performance subject to a set of operation constraints. One fundamental challenge in scheduling is to categorically satisfy different operation constraints in practice; specifically, to guarantee hard constraints and ensure soft constraints violations within a pre-defined threshold. We design a novel projection-based proximal policy optimization (PPPO) algorithm in combination with an Integer Linear optimization technique to intelligently schedule LRA containers under operation constraints. In order to reduce the training time, we apply transfer learning technique by taking advantage of the similarity in different LRA scheduling events. We prove theoretically that our proposed algorithm is effective, stable, and safe. We implement George as a plug-in service in Docker Swarm. Our in-house cluster demonstrates that George can maximize the LRA performance while enforcing the hard constraints and the soft constraints with a pre-defined threshold. The experiments show that George improves LRA performance and scale tremendously by requiring less than 1 hour scheduling time in a large cluster with 2K containers and 700 machines, 16x faster than existing schedulers. Compared with state-of-the-art alternatives, George also achieves 26% higher container performance with up to 70% lower constraint violation.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76124128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Towards Reliable AI for Source Code Understanding 面向源代码理解的可靠AI
Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim Laredo, Alessandro Morari
Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in turn, managing OSS code quality has become critical in ensuring sustainable Cloud growth. On this front, AI modeling has gained popularity in source code understanding tasks, promoted by the ready availability of large open codebases. However, we have been observing certain peculiarities with these black-boxes, motivating a call for their reliability to be verified before offsetting traditional code analysis. In this work, we highlight and organize different reliability issues affecting AI-for-code into three stages of an AI pipeline- data collection, model training, and prediction analysis. We highlight the need for concerted efforts from the research community to ensure credibility, accountability, and traceability for AI-for-code. For each stage, we discuss unique opportunities afforded by the source code and software engineering setting to improve AI reliability.
云的成熟和普及导致了开源软件(OSS)的激增。同时,管理OSS代码质量对于确保云的可持续发展也变得至关重要。在这方面,由于大型开放代码库的可用性,AI建模在源代码理解任务中得到了普及。然而,我们一直在观察这些黑盒的某些特性,这促使人们呼吁在抵消传统的代码分析之前验证它们的可靠性。在这项工作中,我们强调并组织了影响人工智能代码的不同可靠性问题,将其分为人工智能管道的三个阶段——数据收集、模型训练和预测分析。我们强调需要研究界共同努力,确保人工智能代码的可信度、问责性和可追溯性。对于每个阶段,我们讨论了源代码和软件工程设置提供的独特机会,以提高人工智能的可靠性。
{"title":"Towards Reliable AI for Source Code Understanding","authors":"Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim Laredo, Alessandro Morari","doi":"10.1145/3472883.3486995","DOIUrl":"https://doi.org/10.1145/3472883.3486995","url":null,"abstract":"Cloud maturity and popularity have resulted in Open source software (OSS) proliferation. And, in turn, managing OSS code quality has become critical in ensuring sustainable Cloud growth. On this front, AI modeling has gained popularity in source code understanding tasks, promoted by the ready availability of large open codebases. However, we have been observing certain peculiarities with these black-boxes, motivating a call for their reliability to be verified before offsetting traditional code analysis. In this work, we highlight and organize different reliability issues affecting AI-for-code into three stages of an AI pipeline- data collection, model training, and prediction analysis. We highlight the need for concerted efforts from the research community to ensure credibility, accountability, and traceability for AI-for-code. For each stage, we discuss unique opportunities afforded by the source code and software engineering setting to improve AI reliability.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73227404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces tprof:通过结构化聚合和分布式系统跟踪的自动分析来进行性能分析
Lexiang Huang, T. Zhu
The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.
传统的性能调试方法依赖于提供平均函数运行时信息的性能分析器(例如,gprof、VTune)。这些汇总统计信息有助于识别影响整个工作负载的缓慢区域,但它们不适合识别仅影响一小部分工作负载的缓慢区域,例如尾部延迟效应。本文采用了一种利用分布式跟踪系统(例如,Dapper, Zipkin, Jaeger)进行性能分析的新方法。由于跟踪提供了基于每个请求的详细计时信息,因此可以用许多不同的方式对跟踪数据进行分组和聚合,以识别系统中速度较慢的部分。我们的跟踪聚合新方法使用嵌入在跟踪中的结构分层地对相似的跟踪进行分组,并根据跟踪分组的方式计算越来越详细的聚合统计信息。我们还开发了一个自动化工具,用于分析统计数据的层次结构,以识别最可能出现的性能问题。我们跨两个复杂分布式系统的案例研究说明了我们的工具如何能够发现多个性能问题,从而分别在平均和尾延迟方面实现10倍和28倍的性能改进。我们与最先进的行业工具的比较表明,我们的工具可以比目前的方法更准确地查明性能下降。
{"title":"tprof: Performance profiling via structural aggregation and automated analysis of distributed systems traces","authors":"Lexiang Huang, T. Zhu","doi":"10.1145/3472883.3486994","DOIUrl":"https://doi.org/10.1145/3472883.3486994","url":null,"abstract":"The traditional approach for performance debugging relies upon performance profilers (e.g., gprof, VTune) that provide average function runtime information. These aggregate statistics help identify slow regions affecting the entire workload, but they are ill-suited for identifying slow regions that only impact a fraction of the workload, such as tail latency effects. This paper takes a new approach to performance profiling by utilizing distributed tracing systems (e.g., Dapper, Zipkin, Jaeger). Since traces provide detailed timing information on a per-request basis, it is possible to group and aggregate tracing data in many different ways to identify the slow parts of the system. Our new approach to trace aggregation uses the structure embedded within traces to hierarchically group similar traces and calculate increasingly detailed aggregate statistics based on how the traces are grouped. We also develop an automated tool for analyzing the hierarchy of statistics to identify the most likely performance issues. Our case study across two complex distributed systems illustrates how our tool is able to find multiple performance issues that lead to 10x and 28x performance improvements in terms of average and tail latency, respectively. Our comparison with a state-of-the-art industry tool shows that our tool can pinpoint performance slowdowns more accurately than current approaches.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84448606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud 等待的人会有好事:优化云中的工作等待
Pradeep Ambati, Noman Bashir, David E. Irwin, P. Shenoy
Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but also a waiting policy, which selects which jobs wait for fixed resources when they are not available, rather than run on on-demand resources. As with scheduling policies, optimizing waiting policies requires a priori knowledge of job runtime. Unfortunately, prior work has shown that accurately predicting job runtime is challenging. In this paper, we show that optimizing job waiting in the cloud is possible without accurate job runtime predictions. To do so, we i) speculatively execute jobs on on-demand resources for a small time and cost to learn more about job runtime, and ii) develop a ML model to predict wait time from cluster state, which is more accurate and has less overhead than prior approaches that use job runtime predictions. We evaluate our approach on a year-long batch workload consisting of 14 million jobs, and show that it yields a cost and average wait time within 4% and 13%, respectively, of the optimal.
支持云的调度器在固定资源或从云平台按需获取的资源上执行作业。因此,这些调度器不仅必须定义调度策略(在固定资源可用时选择哪些作业运行),还必须定义等待策略(在固定资源不可用时选择哪些作业等待固定资源,而不是在按需资源上运行)。与调度策略一样,优化等待策略需要对作业运行时有先验的了解。不幸的是,先前的工作表明,准确预测作业运行时是具有挑战性的。在本文中,我们证明了在没有精确的作业运行时预测的情况下,优化云中的作业等待是可能的。为此,我们i)推测性地在按需资源上执行作业,花费很少的时间和成本,以了解更多关于作业运行时的信息;ii)开发一个ML模型,从集群状态预测等待时间,这比以前使用作业运行时预测的方法更准确,开销更少。我们在由1400万个作业组成的为期一年的批处理工作负载上评估了我们的方法,并表明它产生的成本和平均等待时间分别在最佳方法的4%和13%之内。
{"title":"Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud","authors":"Pradeep Ambati, Noman Bashir, David E. Irwin, P. Shenoy","doi":"10.1145/3472883.3487007","DOIUrl":"https://doi.org/10.1145/3472883.3487007","url":null,"abstract":"Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but also a waiting policy, which selects which jobs wait for fixed resources when they are not available, rather than run on on-demand resources. As with scheduling policies, optimizing waiting policies requires a priori knowledge of job runtime. Unfortunately, prior work has shown that accurately predicting job runtime is challenging. In this paper, we show that optimizing job waiting in the cloud is possible without accurate job runtime predictions. To do so, we i) speculatively execute jobs on on-demand resources for a small time and cost to learn more about job runtime, and ii) develop a ML model to predict wait time from cluster state, which is more accurate and has less overhead than prior approaches that use job runtime predictions. We evaluate our approach on a year-long batch workload consisting of 14 million jobs, and show that it yields a cost and average wait time within 4% and 13%, respectively, of the optimal.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81868370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1