S. A. Javadi, Amoghavarsha Suresh, Muhammad Wajahat, Anshul Gandhi
Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.
{"title":"Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments","authors":"S. A. Javadi, Amoghavarsha Suresh, Muhammad Wajahat, Anshul Gandhi","doi":"10.1145/3357223.3362734","DOIUrl":"https://doi.org/10.1145/3357223.3362734","url":null,"abstract":"Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72819136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary Schall-Zimmerman, Kaveh Kamgar, N. S. Senobari, Brian Crites, G. Funning, P. Brisk, Eamonn J. Keogh
The discovery of conserved (repeated) patterns in time series is arguably the most important primitive in time series data mining. Called time series motifs, these primitive patterns are useful in their own right, and are also used as inputs into classification, clustering, segmentation, visualization, and anomaly detection algorithms. Recently the Matrix Profile has emerged as a promising representation to allow the efficient exact computation of the top-k motifs in a time series. State-of-the-art algorithms for computing the Matrix Profile are fast enough for many tasks. However, in a handful of domains, including astronomy and seismology, there is an insatiable appetite to consider ever larger datasets. In this work we show that with several novel insights we can push the motif discovery envelope using a novel scalable framework in conjunction with a deployment to commercial GPU clusters in the cloud. We demonstrate the utility of our ideas with detailed case studies in seismology, demonstrating that the efficiency of our algorithm allows us to exhaustively consider datasets that are currently only approximately searchable, allowing us to find subtle precursor earthquakes that had previously escaped attention, and other novel seismic regularities.
{"title":"Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond","authors":"Zachary Schall-Zimmerman, Kaveh Kamgar, N. S. Senobari, Brian Crites, G. Funning, P. Brisk, Eamonn J. Keogh","doi":"10.1145/3357223.3362721","DOIUrl":"https://doi.org/10.1145/3357223.3362721","url":null,"abstract":"The discovery of conserved (repeated) patterns in time series is arguably the most important primitive in time series data mining. Called time series motifs, these primitive patterns are useful in their own right, and are also used as inputs into classification, clustering, segmentation, visualization, and anomaly detection algorithms. Recently the Matrix Profile has emerged as a promising representation to allow the efficient exact computation of the top-k motifs in a time series. State-of-the-art algorithms for computing the Matrix Profile are fast enough for many tasks. However, in a handful of domains, including astronomy and seismology, there is an insatiable appetite to consider ever larger datasets. In this work we show that with several novel insights we can push the motif discovery envelope using a novel scalable framework in conjunction with a deployment to commercial GPU clusters in the cloud. We demonstrate the utility of our ideas with detailed case studies in seismology, demonstrating that the efficiency of our algorithm allows us to exhaustively consider datasets that are currently only approximately searchable, allowing us to find subtle precursor earthquakes that had previously escaped attention, and other novel seismic regularities.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76804671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cluster schedulers routinely face data-parallel jobs with complex task dependencies expressed as DAGs (directed acyclic graphs). Understanding DAG structures and runtime characteristics in large production clusters hence plays a key role in scheduler design, which, however, remains an important missing piece in the literature. In this work, we present a comprehensive study of a recently released cluster trace in Alibaba. We examine the dependency structures of Alibaba jobs and find that their DAGs have sparsely connected vertices and can be approximately decomposed into multiple trees with bounded depth. We also characterize the runtime performance of DAGs and show that dependent tasks may have significant variability in resource usage and duration---even for recurring tasks. In both aspects, we compare the query jobs in the standard TPC benchmarks with the production DAGs and find the former inadequately representative. To better benchmark DAG schedulers at scale, we develop a workload generator that can faithfully synthesize task dependencies based on the production Alibaba trace. Extensive evaluations show that the synthesized DAGs have consistent statistical characteristics as the production DAGs, and the synthesized and real workloads yield similar scheduling results with various schedulers.
{"title":"Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud","authors":"Huangshi Tian, Yunchuan Zheng, Wei Wang","doi":"10.1145/3357223.3362710","DOIUrl":"https://doi.org/10.1145/3357223.3362710","url":null,"abstract":"Cluster schedulers routinely face data-parallel jobs with complex task dependencies expressed as DAGs (directed acyclic graphs). Understanding DAG structures and runtime characteristics in large production clusters hence plays a key role in scheduler design, which, however, remains an important missing piece in the literature. In this work, we present a comprehensive study of a recently released cluster trace in Alibaba. We examine the dependency structures of Alibaba jobs and find that their DAGs have sparsely connected vertices and can be approximately decomposed into multiple trees with bounded depth. We also characterize the runtime performance of DAGs and show that dependent tasks may have significant variability in resource usage and duration---even for recurring tasks. In both aspects, we compare the query jobs in the standard TPC benchmarks with the production DAGs and find the former inadequately representative. To better benchmark DAG schedulers at scale, we develop a workload generator that can faithfully synthesize task dependencies based on the production Alibaba trace. Extensive evaluations show that the synthesized DAGs have consistent statistical characteristics as the production DAGs, and the synthesized and real workloads yield similar scheduling results with various schedulers.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"175 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73661940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuyang Liu, Shucheng Wang, Q. Cao, Ziyi Lu, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang
Cloud providers like the Alibaba cloud routinely and widely employ hybrid storage nodes composed of solid-state drives (SSDs) and hard disk drives (HDDs), reaping their respective benefits: performance from SSD and capacity from HDD. These hybrid storage nodes generally write incoming data to its SSDs and then flush them to their HDD counterparts, referred to as the SSD Write Back (SWB) mode, thereby ensuring low write latency. When comprehensively analyzing real production workloads from Pangu, a large-scale storage platform underlying the Alibaba cloud, we find that (1) there exist many write dominated storage nodes (WSNs); however, (2) under the SWB mode, the SSDs of these WSNs suffer from severely high write intensity and long tail latency. To address these unique observed problems of WSNs, we present SSD Write Redirect (SWR), a runtime IO scheduling mechanism for WSNs. SWR judiciously and selectively forwards some or all SSD-writes to HDDs, adapting to runtime conditions. By effectively offloading the right amount of write IOs from overburdened SSDs to underutilized HDDs in WSNs, SWR is able to adequately alleviate the aforementioned problems suffered by WSNs. This significantly improves overall system performance and SSD endurance. Our trace-driven evaluation of SWR, through replaying production workload traces collected from the Alibaba cloud in our cloud testbed, shows that SWR decreases the average and 99til-percentile latencies of SSD-writes by up to 13% and 47% respectively, notably improving system performance. Meanwhile the amount of data written to SSDs is reduced by up to 70%, significantly improving SSD lifetime.
像阿里云这样的云提供商经常并广泛地使用由固态硬盘(SSD)和硬盘驱动器(HDD)组成的混合存储节点,获得各自的优势:SSD的性能和HDD的容量。这些混合存储节点通常将传入的数据写入其SSD,然后将其刷新到对应的HDD,称为SSD write Back (SWB)模式,从而确保较低的写入延迟。综合分析盘古(阿里云下的大型存储平台)的实际生产工作负载,我们发现:(1)存在许多写主导存储节点(wsn);然而,(2)在SWB模式下,这些wsn的ssd存在严重的高写强度和长尾延迟。为了解决这些独特的观察到的问题,我们提出了SSD写重定向(SWR),一种用于wsn的运行时IO调度机制。SWR明智地有选择地将部分或全部ssd写入转发到hdd,以适应运行时条件。通过在wsn中有效地将适当数量的写io从负担过重的ssd上卸载到未充分利用的hdd上,SWR能够充分缓解wsn遭受的上述问题。这大大提高了整体系统性能和SSD耐用性。通过在我们的云测试平台中重播从阿里云收集的生产工作负载跟踪,我们对SWR的跟踪驱动评估表明,SWR将ssd写入的平均延迟和99%延迟分别降低了13%和47%,显著提高了系统性能。同时,写入SSD的数据量最多减少70%,显著提高SSD寿命。
{"title":"Analysis of and Optimization for Write-dominated Hybrid Storage Nodes in Cloud","authors":"Shuyang Liu, Shucheng Wang, Q. Cao, Ziyi Lu, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1145/3357223.3362705","DOIUrl":"https://doi.org/10.1145/3357223.3362705","url":null,"abstract":"Cloud providers like the Alibaba cloud routinely and widely employ hybrid storage nodes composed of solid-state drives (SSDs) and hard disk drives (HDDs), reaping their respective benefits: performance from SSD and capacity from HDD. These hybrid storage nodes generally write incoming data to its SSDs and then flush them to their HDD counterparts, referred to as the SSD Write Back (SWB) mode, thereby ensuring low write latency. When comprehensively analyzing real production workloads from Pangu, a large-scale storage platform underlying the Alibaba cloud, we find that (1) there exist many write dominated storage nodes (WSNs); however, (2) under the SWB mode, the SSDs of these WSNs suffer from severely high write intensity and long tail latency. To address these unique observed problems of WSNs, we present SSD Write Redirect (SWR), a runtime IO scheduling mechanism for WSNs. SWR judiciously and selectively forwards some or all SSD-writes to HDDs, adapting to runtime conditions. By effectively offloading the right amount of write IOs from overburdened SSDs to underutilized HDDs in WSNs, SWR is able to adequately alleviate the aforementioned problems suffered by WSNs. This significantly improves overall system performance and SSD endurance. Our trace-driven evaluation of SWR, through replaying production workload traces collected from the Alibaba cloud in our cloud testbed, shows that SWR decreases the average and 99til-percentile latencies of SSD-writes by up to 13% and 47% respectively, notably improving system performance. Meanwhile the amount of data written to SSDs is reduced by up to 70%, significantly improving SSD lifetime.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80160402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data f
{"title":"Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost","authors":"Hiren Patel, Alekh Jindal, C. Szyperski","doi":"10.1145/3357223.3366029","DOIUrl":"https://doi.org/10.1145/3357223.3366029","url":null,"abstract":"The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data f","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76185000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, many applications have started using serverless computing platforms primarily due to the ease of deployment and cost efficiency they offer. However, the existing scheduling mechanisms of serverless platforms fall short in catering to the unique characteristics of such applications: burstiness, short and variable execution times, statelessness and use of a single core. Specifically, the existing mechanisms fall short in meeting the requirements generated due to the combined effect of these characteristics: scheduling at a scale of millions of function invocations per second while achieving predictable performance. In this paper, we argue for a cluster-level centralized and core-granular scheduler for serverless functions. By maintaining a global view of the cluster resources, the centralized approach eliminates queue imbalances while the core granularity reduces interference; together these properties enable reduced performance variability. We expect such a scheduler to increase the adoption of serverless computing platforms by various latency and throughput sensitive applications.
{"title":"Centralized Core-granular Scheduling for Serverless Functions","authors":"Kostis Kaffes, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3357223.3362709","DOIUrl":"https://doi.org/10.1145/3357223.3362709","url":null,"abstract":"In recent years, many applications have started using serverless computing platforms primarily due to the ease of deployment and cost efficiency they offer. However, the existing scheduling mechanisms of serverless platforms fall short in catering to the unique characteristics of such applications: burstiness, short and variable execution times, statelessness and use of a single core. Specifically, the existing mechanisms fall short in meeting the requirements generated due to the combined effect of these characteristics: scheduling at a scale of millions of function invocations per second while achieving predictable performance. In this paper, we argue for a cluster-level centralized and core-granular scheduler for serverless functions. By maintaining a global view of the cluster resources, the centralized approach eliminates queue imbalances while the core granularity reduces interference; together these properties enable reduced performance variability. We expect such a scheduler to increase the adoption of serverless computing platforms by various latency and throughput sensitive applications.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74093879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible. The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.
{"title":"Hotspot Mitigations for the Masses","authors":"Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Shukla, Ramesh Chandra","doi":"10.1145/3357223.3362717","DOIUrl":"https://doi.org/10.1145/3357223.3362717","url":null,"abstract":"In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible. The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77626834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
According to YouTube statistics [1], more than 400 hours of content is uploaded to its platform every minute. At this rate, it is estimated that it would take more than 70 years of continuous watch time in order to view all content on YouTube, assuming no more content is uploaded. This raises great challenges when attempting to actively process and analyze video content. Real-time video processing is a critical element that brings forth numerous applications otherwise infeasible due to scalability constraints. Predictive models are commonly used, specifically Neural Networks (NNs), to accelerate processing time when analyzing realtime content. However, applying NNs is computationally expensive. Advanced hardware (e.g. graphics processing units or GPUs) and cloud infrastructure are usually utilized to meet the demand of processing applications. Nevertheless, recent work in the field of edge computing aims to develop systems that relieve the load on the cloud by delegating parts of the job to edge nodes. Such systems emphasize processing as much as possible within the edge node before delegating the load to the cloud in hopes of reducing the latency. In addition, processing content in the edge promotes the privacy and security of the data. One example is the work by Grulich et al. [2] where the edge node relieves some of the work load off the cloud by splitting, differentiating and compressing the NN used to analyze the content. Even though the collaboration between the edge node and the cloud expedites the processing time by relying on the edge node's capability, there is still room for improvement. Our proposal aims to utilize the edge nodes even further by allowing the nodes to collaborate among themselves as a para-cloud that minimizes the dependency on the primary processing cloud. We propose a collaborative system solution where a video uploaded on an edge node could be labeled and analyzed collaboratively without the need to utilize cloud resources. The proposed collaborative system is illustrated in Figure 1. The system consists of multiple edge nodes that acquire video content from different sources. Each node starts the analysis process via a specialized, smaller NN [3] utilizing the edge node's processing power. Whenever the load overwhelms the node or the node is unable to provide accurate analysis via its specialized NN, the node requests other edge nodes to collaborate on the analysis instead of delegating to the cloud resources. This way the high latency is avoided and other edge node processing power is utilized by splitting the NN among the different edge nodes and distributing the processing load between them. The main contribution of this proposed approach is the alternative conceptualization of collaborative computing: instead of building a system that allows collaboration between edge nodes and the cloud, we explore the prospective of collaboration between edge nodes, minimizing the involvement of the cloud resources even furth
{"title":"Collaborative Edge-Cloud and Edge-Edge Video Analytics","authors":"Samaa Gazzaz, Faisal Nawab","doi":"10.1145/3357223.3366024","DOIUrl":"https://doi.org/10.1145/3357223.3366024","url":null,"abstract":"According to YouTube statistics [1], more than 400 hours of content is uploaded to its platform every minute. At this rate, it is estimated that it would take more than 70 years of continuous watch time in order to view all content on YouTube, assuming no more content is uploaded. This raises great challenges when attempting to actively process and analyze video content. Real-time video processing is a critical element that brings forth numerous applications otherwise infeasible due to scalability constraints. Predictive models are commonly used, specifically Neural Networks (NNs), to accelerate processing time when analyzing realtime content. However, applying NNs is computationally expensive. Advanced hardware (e.g. graphics processing units or GPUs) and cloud infrastructure are usually utilized to meet the demand of processing applications. Nevertheless, recent work in the field of edge computing aims to develop systems that relieve the load on the cloud by delegating parts of the job to edge nodes. Such systems emphasize processing as much as possible within the edge node before delegating the load to the cloud in hopes of reducing the latency. In addition, processing content in the edge promotes the privacy and security of the data. One example is the work by Grulich et al. [2] where the edge node relieves some of the work load off the cloud by splitting, differentiating and compressing the NN used to analyze the content. Even though the collaboration between the edge node and the cloud expedites the processing time by relying on the edge node's capability, there is still room for improvement. Our proposal aims to utilize the edge nodes even further by allowing the nodes to collaborate among themselves as a para-cloud that minimizes the dependency on the primary processing cloud. We propose a collaborative system solution where a video uploaded on an edge node could be labeled and analyzed collaboratively without the need to utilize cloud resources. The proposed collaborative system is illustrated in Figure 1. The system consists of multiple edge nodes that acquire video content from different sources. Each node starts the analysis process via a specialized, smaller NN [3] utilizing the edge node's processing power. Whenever the load overwhelms the node or the node is unable to provide accurate analysis via its specialized NN, the node requests other edge nodes to collaborate on the analysis instead of delegating to the cloud resources. This way the high latency is avoided and other edge node processing power is utilized by splitting the NN among the different edge nodes and distributing the processing load between them. The main contribution of this proposed approach is the alternative conceptualization of collaborative computing: instead of building a system that allows collaboration between edge nodes and the cloud, we explore the prospective of collaboration between edge nodes, minimizing the involvement of the cloud resources even furth","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80151930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alekh Jindal, Hiren Patel, Abhishek Roy, S. Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan
Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users. We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.
{"title":"Peregrine: Workload Optimization for Cloud Query Engines","authors":"Alekh Jindal, Hiren Patel, Abhishek Roy, S. Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan","doi":"10.1145/3357223.3362726","DOIUrl":"https://doi.org/10.1145/3357223.3362726","url":null,"abstract":"Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users. We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80975994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The basic structure of the LSM[3] tree consists of four levels (we are considering only 4 levels), L0 in memory, and L1 to L3 in the disk. Compaction in L0/L1 is done through tiering, and compaction in the rest of the tree is done through leveling. Cooperative-LSM (C-LSM) is implemented by deconstructing the monolithic structure of LSM[3] trees to enhance the scalability of LSM trees by utilizing the resources of multiple machines in a more flexible way. The monolithic structure of LSM[3] tree lacks flexibility, and the only way to deal with an increased load on is to re-partition the data and distribute it across nodes. C-LSM comprises of three components - leader, compactor, and backup. Leader node receives write requests. It maintains Levels L0 and L1 of the LSM tree and performs minor compactions. Compactor maintains the rest of the levels (L2 and L3) and is responsible for compacting them. Backup maintains a copy of the entire LSM tree for fault-tolerance and read availability. The advantages that C-LSM provides are two-fold: • one can place components on different machines, and • one can have more than one instance of each component Running more than one instance for each component can enable various performance advantages: • Increasing the number of Leaders enables to digest data faster because the performance of a single machine no longer limits the system. • Increasing the number of Compactors enables to offload compaction[1] to more nodes and thus reduce the impact of compaction on other functions. • Increasing the number of backups increases read availability. Although, all these advantages can be achieved by re-partitioning the data and distributing the partitions across nodes, which most current LSM variants do. However, we hypothesize that partitioning is not feasible for all cases. For example, a dynamic workload where access patterns are unpredictable and no clear partitioning is feasible. In this case, the developer either has to endure the overhead of re-partitioning the data all the time or not be able to utilize the system resources efficiently if no re-partitioning is done. C-LSM enables scaling (and down-scaling) with less overhead compared to re-partitioning; if a partition is suddenly getting more requests, one can simply add a new component on another node. Each one of the components has different characteristics in terms of how it affects the workload and I/O. By having the flexibility to break down the components, one can find ways to distribute them in a way to increase overall efficiency. Having multiple instances of the three components leads to interesting challenges in terms of how to ensure that they work together without leading to any inconsistencies. We are trying to solve this through careful design of how these components interact and how to manage the decisions when failures or scaling events happen. Another interesting problem to solve is having multiple instances of C-LSM, each dedicated to one edge node o
{"title":"C-LSM: Cooperative Log Structured Merge Trees","authors":"Natasha Mittal, Faisal Nawab","doi":"10.1145/3357223.3365443","DOIUrl":"https://doi.org/10.1145/3357223.3365443","url":null,"abstract":"The basic structure of the LSM[3] tree consists of four levels (we are considering only 4 levels), L0 in memory, and L1 to L3 in the disk. Compaction in L0/L1 is done through tiering, and compaction in the rest of the tree is done through leveling. Cooperative-LSM (C-LSM) is implemented by deconstructing the monolithic structure of LSM[3] trees to enhance the scalability of LSM trees by utilizing the resources of multiple machines in a more flexible way. The monolithic structure of LSM[3] tree lacks flexibility, and the only way to deal with an increased load on is to re-partition the data and distribute it across nodes. C-LSM comprises of three components - leader, compactor, and backup. Leader node receives write requests. It maintains Levels L0 and L1 of the LSM tree and performs minor compactions. Compactor maintains the rest of the levels (L2 and L3) and is responsible for compacting them. Backup maintains a copy of the entire LSM tree for fault-tolerance and read availability. The advantages that C-LSM provides are two-fold: • one can place components on different machines, and • one can have more than one instance of each component Running more than one instance for each component can enable various performance advantages: • Increasing the number of Leaders enables to digest data faster because the performance of a single machine no longer limits the system. • Increasing the number of Compactors enables to offload compaction[1] to more nodes and thus reduce the impact of compaction on other functions. • Increasing the number of backups increases read availability. Although, all these advantages can be achieved by re-partitioning the data and distributing the partitions across nodes, which most current LSM variants do. However, we hypothesize that partitioning is not feasible for all cases. For example, a dynamic workload where access patterns are unpredictable and no clear partitioning is feasible. In this case, the developer either has to endure the overhead of re-partitioning the data all the time or not be able to utilize the system resources efficiently if no re-partitioning is done. C-LSM enables scaling (and down-scaling) with less overhead compared to re-partitioning; if a partition is suddenly getting more requests, one can simply add a new component on another node. Each one of the components has different characteristics in terms of how it affects the workload and I/O. By having the flexibility to break down the components, one can find ways to distribute them in a way to increase overall efficiency. Having multiple instances of the three components leads to interesting challenges in terms of how to ensure that they work together without leading to any inconsistencies. We are trying to solve this through careful design of how these components interact and how to manage the decisions when failures or scaling events happen. Another interesting problem to solve is having multiple instances of C-LSM, each dedicated to one edge node o","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79394621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}