Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献

英文中文

Building Reliable Cloud Services Using Coyote Actors 使用Coyote Actors构建可靠的云服务

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486983

Pantazis Deligiannis, Narayanan Ganapathy, A. Lal, S. Qadeer

Cloud services must typically be distributed across a large number of machines in order to make use of multiple compute and storage resources. This opens the programmer to several sources of complexity such as concurrency, order of message delivery, lossy network, timeouts and failures, all of which impose a high cognitive burden. This paper presents evidence that technology inspired by formal-methods, delivered as part of a programming framework, can help address these challenges. In particular, we describe the experience of several engineering teams in Microsoft Azure that used the open-source Coyote Actor programming framework to build multiple reliable cloud services. Coyote Actors impose a principled design pattern that allows writing formal specifications alongside production code that can be systematically tested, without deviating from routine engineering practices. Engineering teams that have been using Coyote have reported dramatically increased productivity (in time taken to push new features to production) as well as services that have been running live for months without any issues in features developed and tested with Coyote.

云服务通常必须分布在大量机器上，以便使用多个计算和存储资源。这让程序员面对了几个复杂性的来源，比如并发性、消息传递顺序、有损网络、超时和故障，所有这些都带来了很高的认知负担。本文提供的证据表明，受形式化方法启发的技术，作为编程框架的一部分，可以帮助解决这些挑战。特别地，我们描述了几个工程团队在Microsoft Azure中使用开源Coyote Actor编程框架构建多个可靠云服务的经验。Coyote Actors强加了一种原则性的设计模式，允许编写正式的规范以及可以系统测试的产品代码，而不会偏离常规的工程实践。使用Coyote的工程团队已经报告了显著提高的生产力(将新功能推向生产所需的时间)，以及使用Coyote开发和测试的功能已经运行了几个月而没有任何问题的服务。

{"title":"Building Reliable Cloud Services Using Coyote Actors","authors":"Pantazis Deligiannis, Narayanan Ganapathy, A. Lal, S. Qadeer","doi":"10.1145/3472883.3486983","DOIUrl":"https://doi.org/10.1145/3472883.3486983","url":null,"abstract":"Cloud services must typically be distributed across a large number of machines in order to make use of multiple compute and storage resources. This opens the programmer to several sources of complexity such as concurrency, order of message delivery, lossy network, timeouts and failures, all of which impose a high cognitive burden. This paper presents evidence that technology inspired by formal-methods, delivered as part of a programming framework, can help address these challenges. In particular, we describe the experience of several engineering teams in Microsoft Azure that used the open-source Coyote Actor programming framework to build multiple reliable cloud services. Coyote Actors impose a principled design pattern that allows writing formal specifications alongside production code that can be systematically tested, without deviating from routine engineering practices. Engineering teams that have been using Coyote have reported dramatically increased productivity (in time taken to push new features to production) as well as services that have been running live for months without any issues in features developed and tested with Coyote.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74136078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis 云在规模上的异质性和动态性:谷歌跟踪分析

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3517123

Charles Reiss, Alexey Tumanov

Test of Time Award Talk for Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis, SoCC 2012.

尺度下云的异质性和动态性的时间检验:谷歌痕量分析，2012。

引用次数: 5

Fast and Accurate Optimizer for Query Processing over Knowledge Graphs 快速准确的知识图查询处理优化器

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486991

Jingqi Wu, Rong Chen, Yubin Xia

This paper presents Gpl, a fast and accurate optimizer for query processing over knowledge graphs. Gpl is novel in three ways. First, Gpl proposes a type-centric approach to enhance the accuracy of cardinality estimation prominently, which naturally embeds the correlation of multiple query conditions into the existing type system of knowledge graphs. Second, to predict execution time accurately, Gpl constructs a specialized cost model for graph exploration scheme and tunes the coefficients with target hardware platform and graph data. Third, Gpl further uses a budget-aware strategy for plan enumeration with a greedy heuristic to boost the overall performance (i.e., optimization time and execution time) for various workloads. Evaluations with representative knowledge graphs and query benchmarks show that Gpl can select optimal plans for 33 of 39 queries and only incurs less than 5% slowdown on average compared to optimal results. In contrast, the state-of-the-art optimizer and manually tuned results will cause 100% and 36% slowdown, respectively.

本文提出了一种快速、准确的知识图查询处理优化器Gpl。Gpl在三个方面是新颖的。首先，Gpl提出了一种以类型为中心的方法，显著提高了基数估计的准确性，该方法自然地将多个查询条件的相关性嵌入到现有的知识图类型系统中。其次，为了准确预测执行时间，Gpl为图探测方案构建了专门的成本模型，并根据目标硬件平台和图数据调整系数。第三，Gpl进一步使用预算感知策略进行计划枚举，并使用贪婪启发式来提高各种工作负载的整体性能(即优化时间和执行时间)。使用代表性知识图和查询基准的评估表明，Gpl可以为39个查询中的33个选择最优计划，并且与最优结果相比，平均只导致不到5%的减速。相比之下，最先进的优化器和手动调整的结果将分别导致100%和36%的减速。

引用次数: 2

Parslo Parslo

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486985

Amirhossein Mirhosseini, S. Elnikety, T. Wenisch

Modern cloud services are implemented as graphs of loosely-coupled microservices to improve programmability, reliability, and scalability. Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction. In such environments, each microservice is independently deployed and (auto-)scaled. However, it is unclear how to optimally scale individual microservices when end-to-end SLOs are violated or underutilized, and how to size each microservice to meet the end-to-end SLO at minimal total cost. In this paper, we propose Parslo---a Gradient Descent-based approach to assign partial SLOs among nodes in a microservice graph under an end-to-end latency SLO. At a high level, the Parslo algorithm breaks the end-to-end SLO budget into small incremental "SLO units", and iteratively allocates one marginal SLO unit to the best candidate microservice to achieve the highest total cost savings until the entire end-to-end SLO budget is exhausted. Parslo achieves a near-optimal solution, seeking to minimize the total cost for the entire service deployment, and is applicable to general microservice graphs that comprise patterns like dynamic branching, parallel fan-out, and microservice dependencies. Parslo reduces service deployment costs by more than 6x in real microservice-based applications, compared to a state-of-the-art partial SLO assignment scheme.

{"title":"Parslo","authors":"Amirhossein Mirhosseini, S. Elnikety, T. Wenisch","doi":"10.1145/3472883.3486985","DOIUrl":"https://doi.org/10.1145/3472883.3486985","url":null,"abstract":"Modern cloud services are implemented as graphs of loosely-coupled microservices to improve programmability, reliability, and scalability. Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction. In such environments, each microservice is independently deployed and (auto-)scaled. However, it is unclear how to optimally scale individual microservices when end-to-end SLOs are violated or underutilized, and how to size each microservice to meet the end-to-end SLO at minimal total cost. In this paper, we propose Parslo---a Gradient Descent-based approach to assign partial SLOs among nodes in a microservice graph under an end-to-end latency SLO. At a high level, the Parslo algorithm breaks the end-to-end SLO budget into small incremental \"SLO units\", and iteratively allocates one marginal SLO unit to the best candidate microservice to achieve the highest total cost savings until the entire end-to-end SLO budget is exhausted. Parslo achieves a near-optimal solution, seeking to minimize the total cost for the entire service deployment, and is applicable to general microservice graphs that comprise patterns like dynamic branching, parallel fan-out, and microservice dependencies. Parslo reduces service deployment costs by more than 6x in real microservice-based applications, compared to a state-of-the-art partial SLO assignment scheme.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86605837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Chronus

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486978

Wei Gao, Zhisheng Ye, P. Sun, Yonggang Wen, Tianwei Zhang

Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and fairness across users. Different training jobs may require various objectives and demands in terms of completion time. How to efficiently satisfy all these requirements is not extensively studied. We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs. Chronus is designed based on the unique features of DLT jobs. (1) It leverages the intra-job predictability of DLT processes to efficiently profile jobs and estimate their runtime speed with dynamic resource scaling. (2) It takes advantages of the DLT preemption feature to select jobs with a lease-based training scheme. (3) It considers the placement sensitivity of DLT jobs to allocate resources with new consolidation and local-search strategies. Large-scale simulations on real-world job traces show that Chronus can reduce the deadline miss rate of SLO jobs by up to 14.7x, and the completion time of best-effort jobs by up to 19.9x, compared to existing schedulers. We also implement a prototype of Chronus atop Kubernents in a cluster of 120 GPUs to validate its practicability.

{"title":"Chronus","authors":"Wei Gao, Zhisheng Ye, P. Sun, Yonggang Wen, Tianwei Zhang","doi":"10.1145/3472883.3486978","DOIUrl":"https://doi.org/10.1145/3472883.3486978","url":null,"abstract":"Modern GPU clusters support Deep Learning training (DLT) jobs in a distributed manner. Job scheduling is the key to improve the training performance, resource utilization and fairness across users. Different training jobs may require various objectives and demands in terms of completion time. How to efficiently satisfy all these requirements is not extensively studied. We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs. Chronus is designed based on the unique features of DLT jobs. (1) It leverages the intra-job predictability of DLT processes to efficiently profile jobs and estimate their runtime speed with dynamic resource scaling. (2) It takes advantages of the DLT preemption feature to select jobs with a lease-based training scheme. (3) It considers the placement sensitivity of DLT jobs to allocate resources with new consolidation and local-search strategies. Large-scale simulations on real-world job traces show that Chronus can reduce the deadline miss rate of SLO jobs by up to 14.7x, and the completion time of best-effort jobs by up to 19.9x, compared to existing schedulers. We also implement a prototype of Chronus atop Kubernents in a cluster of 120 GPUs to validate its practicability.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"88 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73388148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Speedo Speedo

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486982

N. Daw, U. Bellur, Purushottam Kulkarni

Structuring cloud applications as collections of interacting fine-grained microservices makes them scalable and affords the flexibility of hot upgrading parts of the application. The current avatar of serverless computing (FaaS) with its dynamic resource allocation and auto-scaling capabilities make it the deployment model of choice for such applications. FaaS platforms operate with user space dispatchers that receive requests over the network and make a dispatch decision to one of multiple workers (usually a container) distributed in the data center. With the granularity of microservices approaching execution times of a few milliseconds combined with loads approaching tens of thousands of requests a second, having a low dispatch latency of less than one millisecond becomes essential to keep up with line rates. When these microservices are part of a workflow making up an application, the orchestrator that coordinates the sequence in which microservices execute also needs to operate with microsecond latency. Our observations reveal that the most significant component of the dispatch/orchestration latency is the time it takes for the request to traverse into and out of the user space from the network. Motivated by the presence of a multitude of low power cores on today's SmartNICs, one approach to keeping up with these high line rates and the stringent latency expectations is to run both the dispatcher and the orchestrator close to the network on a SmartNIC. Doing so will save valuable cycles spent in transferring requests to and back from the user space. The operating characteristics of short-lived ephemeral state and low CPU burst requirements of FaaS dispatcher/orchestrator make them ideal candidates for offloading from the server to the NIC cores. This also brings other benefit of freeing up the server CPU. In this paper, we present Speedo--- a design for offloading of FaaS dispatch and orchestration services to the SmartNIC from the user space. We implemented Speedo on ASIC based Netronome Agilio SmartNICs and our comprehensive evaluation shows that Speedo brings down the dispatch latency from ~150ms to ~140μs at a load of 10K requests per second.

{"title":"Speedo","authors":"N. Daw, U. Bellur, Purushottam Kulkarni","doi":"10.1145/3472883.3486982","DOIUrl":"https://doi.org/10.1145/3472883.3486982","url":null,"abstract":"Structuring cloud applications as collections of interacting fine-grained microservices makes them scalable and affords the flexibility of hot upgrading parts of the application. The current avatar of serverless computing (FaaS) with its dynamic resource allocation and auto-scaling capabilities make it the deployment model of choice for such applications. FaaS platforms operate with user space dispatchers that receive requests over the network and make a dispatch decision to one of multiple workers (usually a container) distributed in the data center. With the granularity of microservices approaching execution times of a few milliseconds combined with loads approaching tens of thousands of requests a second, having a low dispatch latency of less than one millisecond becomes essential to keep up with line rates. When these microservices are part of a workflow making up an application, the orchestrator that coordinates the sequence in which microservices execute also needs to operate with microsecond latency. Our observations reveal that the most significant component of the dispatch/orchestration latency is the time it takes for the request to traverse into and out of the user space from the network. Motivated by the presence of a multitude of low power cores on today's SmartNICs, one approach to keeping up with these high line rates and the stringent latency expectations is to run both the dispatcher and the orchestrator close to the network on a SmartNIC. Doing so will save valuable cycles spent in transferring requests to and back from the user space. The operating characteristics of short-lived ephemeral state and low CPU burst requirements of FaaS dispatcher/orchestrator make them ideal candidates for offloading from the server to the NIC cores. This also brings other benefit of freeing up the server CPU. In this paper, we present Speedo--- a design for offloading of FaaS dispatch and orchestration services to the SmartNIC from the user space. We implemented Speedo on ASIC based Netronome Agilio SmartNICs and our comprehensive evaluation shows that Speedo brings down the dispatch latency from ~150ms to ~140μs at a load of 10K requests per second.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"235 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79703061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Automating instrumentation choices for performance problems in distributed applications with VAIF 使用VAIF为分布式应用程序中的性能问题自动选择检测工具

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3487000

Mert Toslali, E. Ates, Alex Ellis, Zhaoqing Zhang, Darby Huye, Lan Liu, Samantha Puterman, A. Coskun, Raja R. Sambasivan

Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests' traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.

开发人员使用日志来诊断分布式应用程序中的性能问题。然而，很难先验地知道哪些地方需要日志，以及需要其中的哪些信息来帮助诊断将来可能发生的问题。我们提出了方差驱动的自动化仪器框架(VAIF)，它与分布式应用程序一起运行。为了响应新观察到的性能问题，VAIF会自动搜索可能的工具选择空间，以启用帮助诊断这些问题所需的日志。为了工作，VAIF将分布式跟踪(一种增强的日志记录形式)与如何在请求跟踪的关键路径部分分解响应时间方差的见解结合起来。我们通过使用VAIF来定位OpenStack和HDFS中的性能问题来评估VAIF。我们展示了VAIF可以定位与缓慢的代码路径、资源争用和有问题的第三方代码相关的问题，而只启用了总跟踪工具的3-34%。

{"title":"Automating instrumentation choices for performance problems in distributed applications with VAIF","authors":"Mert Toslali, E. Ates, Alex Ellis, Zhaoqing Zhang, Darby Huye, Lan Liu, Samantha Puterman, A. Coskun, Raja R. Sambasivan","doi":"10.1145/3472883.3487000","DOIUrl":"https://doi.org/10.1145/3472883.3487000","url":null,"abstract":"Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests' traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87921214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Future of Cloud Data: Challenges and Research Opportunities 云数据的未来:挑战与研究机遇

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3517040

Peter D. Bailis

The last several years have seen the creation of hundreds of billions of dollars in market value ? including the largest software IPO of all time ? centered around one technology category: cloud data. While cloud data is not new, the rate of adoption across almost every industry and the associated pace of development around all aspects of cloud data (from pipelines to extract-load-transform (ELT) tools to storage and analytics) are unprecedented. In this talk, I'll present a research-oriented perspective on the future of cloud data that combines my experiences as an academic at Stanford and as a startup founder and CEO at Sisu Data. My goal is to provide an overview of the seismic changes in the cloud data landscape that--in my opinion--have yet to receive sufficient attention from research, and to highlight several tantalizing research opportunities in systems and databases that result.

在过去的几年里，创造了数千亿美元的市场价值。包括有史以来最大的软件IPO ?围绕一个技术类别:云数据。虽然云数据并不新鲜，但几乎每个行业的采用率以及围绕云数据各个方面(从管道到提取-负载-转换(ELT)工具，再到存储和分析)的相关开发速度都是前所未有的。在这次演讲中，我将结合我作为斯坦福大学的学者和Sisu data的创业公司创始人兼首席执行官的经历，从一个以研究为导向的角度来阐述云数据的未来。我的目标是概述云数据领域的巨大变化，在我看来，这些变化尚未得到研究的足够关注，并强调由此产生的系统和数据库中的几个诱人的研究机会。

引用次数: 0

Scrooge 吝啬鬼

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3486993

Yitao Hu, Rajrup Ghosh, R. Govindan

Advances in deep learning (DL) have prompted the development of cloud-hosted DL-based media applications that process video and audio streams in real-time. Such applications must satisfy throughput and latency objectives and adapt to novel types of dynamics, while incurring minimal cost. Scrooge, a system that provides media applications as a service, achieves these objectives by packing computations efficiently into GPU-equipped cloud VMs, using an optimization formulation to find the lowest cost VM allocations that meet the performance objectives, and rapidly reacting to variations in input complexity (e.g., changes in participants in a video). Experiments show that Scrooge can save serving cost by 16-32% (which translate to tens of thousands of dollars per year) relative to the state-of-the-art while achieving latency objectives for over 98% under dynamic workloads.

引用次数: 16

Leveraging Data to Improve Cloud Services 利用数据改进云服务

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

Pub Date : 2021-11-01 DOI: 10.1145/3472883.3517038

Ranjita Bhagwan

Today's cloud services are large, complex, and dynamic, often supporting billions of users. Such a complex and dynamic environment poses several challenges such as ensuring fast and secure development and deployment, and prompt resolution of service disruptions. Nevertheless, new opportunities to address such challenges have emerged. Large-scale services generate petabytes of code, test, and usage-related data within just one day. This data can be harnessed to provide valuable insights to engineers on how to improve service performance, security and reliability. However, cherry-picking important information from such vast amounts of systems-related data proves to be a formidable task. Over the last few years, we have developed many analysis tools that leverage code, test logs and telemetry to address these challenges. In this talk, I will talk about our experience with building such tools, and describe our journey which started with determining the right problems to solve, making research contributions and ended with widespread deployment across Microsoft's services.

当今的云服务规模庞大、复杂且动态，通常支持数十亿用户。如此复杂和动态的环境带来了一些挑战，例如确保快速和安全的开发和部署，以及及时解决服务中断。然而，应对这些挑战的新机会已经出现。大规模服务在一天内生成数pb的代码、测试和与使用相关的数据。这些数据可以为工程师提供有价值的见解，帮助他们提高服务性能、安全性和可靠性。然而，从如此庞大的系统相关数据中挑选重要信息被证明是一项艰巨的任务。在过去的几年里，我们开发了许多分析工具，利用代码、测试日志和遥测技术来解决这些挑战。在这次演讲中，我将谈谈我们构建这些工具的经验，并描述我们的旅程，从确定要解决的正确问题开始，做出研究贡献，到在微软的服务中广泛部署。

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀