首页 > 最新文献

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献

英文 中文
Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments 清除器:用于提高云环境利用率的黑盒批处理工作负载资源管理器
S. A. Javadi, Amoghavarsha Suresh, Muhammad Wajahat, Anshul Gandhi
Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.
资源利用不足在云数据中心中很常见。以前的工作建议通过在后台运行提供者工作负载,并与租户工作负载共存来提高利用率。然而,仍未解决的一个重要挑战是将租户工作负载视为黑盒。我们介绍了一个批处理工作负载管理器Scavenger,它可以在黑箱租户vm旁边随机运行容器化的批处理作业,以提高利用率。Scavenger的设计无需任何离线分析或有关租户工作负载的先前信息即可工作。为了随时满足租户虚拟机的资源需求,Scavenger动态调节批处理作业的资源使用情况,包括处理器占用率、内存容量和网络带宽。我们在两个不同的测试平台上使用延迟敏感的租户工作负载和后台的Spark作业对Scavenger进行了实验评估,结果表明Scavenger在不影响租户虚拟机资源需求的情况下显著增加了资源使用。
{"title":"Scavenger: A Black-Box Batch Workload Resource Manager for Improving Utilization in Cloud Environments","authors":"S. A. Javadi, Amoghavarsha Suresh, Muhammad Wajahat, Anshul Gandhi","doi":"10.1145/3357223.3362734","DOIUrl":"https://doi.org/10.1145/3357223.3362734","url":null,"abstract":"Resource under-utilization is common in cloud data centers. Prior works have proposed improving utilization by running provider workloads in the background, colocated with tenant workloads. However, an important challenge that has still not been addressed is considering the tenant workloads as a black-box. We present Scavenger, a batch workload manager that opportunistically runs containerized batch jobs next to black-box tenant VMs to improve utilization. Scavenger is designed to work without requiring any offline profiling or prior information about the tenant workload. To meet the tenant VMs' resource demand at all times, Scavenger dynamically regulates the resource usage of batch jobs, including processor usage, memory capacity, and network bandwidth. We experimentally evaluate Scavenger on two different testbeds using latency-sensitive tenant workloads colocated with Spark jobs in the background and show that Scavenger significantly increases resource usage without compromising the resource demands of tenant VMs.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72819136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond 矩阵配置文件XIV:缩放时间序列Motif发现与gpu打破一百亿亿两两比较一天或更长
Zachary Schall-Zimmerman, Kaveh Kamgar, N. S. Senobari, Brian Crites, G. Funning, P. Brisk, Eamonn J. Keogh
The discovery of conserved (repeated) patterns in time series is arguably the most important primitive in time series data mining. Called time series motifs, these primitive patterns are useful in their own right, and are also used as inputs into classification, clustering, segmentation, visualization, and anomaly detection algorithms. Recently the Matrix Profile has emerged as a promising representation to allow the efficient exact computation of the top-k motifs in a time series. State-of-the-art algorithms for computing the Matrix Profile are fast enough for many tasks. However, in a handful of domains, including astronomy and seismology, there is an insatiable appetite to consider ever larger datasets. In this work we show that with several novel insights we can push the motif discovery envelope using a novel scalable framework in conjunction with a deployment to commercial GPU clusters in the cloud. We demonstrate the utility of our ideas with detailed case studies in seismology, demonstrating that the efficiency of our algorithm allows us to exhaustively consider datasets that are currently only approximately searchable, allowing us to find subtle precursor earthquakes that had previously escaped attention, and other novel seismic regularities.
时间序列中保守(重复)模式的发现可以说是时间序列数据挖掘中最重要的基础。这些原始模式被称为时间序列motif,它们本身就很有用,也可用作分类、聚类、分割、可视化和异常检测算法的输入。最近,矩阵轮廓作为一种很有前途的表示形式出现了,它允许在时间序列中有效地精确计算top-k个基元。计算矩阵轮廓的最先进算法对于许多任务来说足够快。然而,在包括天文学和地震学在内的少数领域,人们对更大的数据集有着永不满足的需求。在这项工作中,我们展示了一些新颖的见解,我们可以使用一个新颖的可扩展框架,结合部署到云中的商业GPU集群,来推动motif发现信封。我们用地震学中详细的案例研究证明了我们的想法的实用性,证明了我们算法的效率使我们能够详尽地考虑目前只能近似搜索的数据集,使我们能够找到以前未引起注意的微妙前兆地震,以及其他新的地震规律。
{"title":"Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond","authors":"Zachary Schall-Zimmerman, Kaveh Kamgar, N. S. Senobari, Brian Crites, G. Funning, P. Brisk, Eamonn J. Keogh","doi":"10.1145/3357223.3362721","DOIUrl":"https://doi.org/10.1145/3357223.3362721","url":null,"abstract":"The discovery of conserved (repeated) patterns in time series is arguably the most important primitive in time series data mining. Called time series motifs, these primitive patterns are useful in their own right, and are also used as inputs into classification, clustering, segmentation, visualization, and anomaly detection algorithms. Recently the Matrix Profile has emerged as a promising representation to allow the efficient exact computation of the top-k motifs in a time series. State-of-the-art algorithms for computing the Matrix Profile are fast enough for many tasks. However, in a handful of domains, including astronomy and seismology, there is an insatiable appetite to consider ever larger datasets. In this work we show that with several novel insights we can push the motif discovery envelope using a novel scalable framework in conjunction with a deployment to commercial GPU clusters in the cloud. We demonstrate the utility of our ideas with detailed case studies in seismology, demonstrating that the efficiency of our algorithm allows us to exhaustively consider datasets that are currently only approximately searchable, allowing us to find subtle precursor earthquakes that had previously escaped attention, and other novel seismic regularities.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76804671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud 阿里云数据并行作业任务依赖关系表征与综合
Huangshi Tian, Yunchuan Zheng, Wei Wang
Cluster schedulers routinely face data-parallel jobs with complex task dependencies expressed as DAGs (directed acyclic graphs). Understanding DAG structures and runtime characteristics in large production clusters hence plays a key role in scheduler design, which, however, remains an important missing piece in the literature. In this work, we present a comprehensive study of a recently released cluster trace in Alibaba. We examine the dependency structures of Alibaba jobs and find that their DAGs have sparsely connected vertices and can be approximately decomposed into multiple trees with bounded depth. We also characterize the runtime performance of DAGs and show that dependent tasks may have significant variability in resource usage and duration---even for recurring tasks. In both aspects, we compare the query jobs in the standard TPC benchmarks with the production DAGs and find the former inadequately representative. To better benchmark DAG schedulers at scale, we develop a workload generator that can faithfully synthesize task dependencies based on the production Alibaba trace. Extensive evaluations show that the synthesized DAGs have consistent statistical characteristics as the production DAGs, and the synthesized and real workloads yield similar scheduling results with various schedulers.
集群调度器通常会面对具有复杂任务依赖关系的数据并行作业,这些任务依赖关系表示为dag(有向无环图)。因此,了解大型生产集群中的DAG结构和运行时特征在调度器设计中起着关键作用,然而,这仍然是文献中重要的缺失部分。在这项工作中,我们对阿里巴巴最近发布的集群跟踪进行了全面研究。我们研究了阿里巴巴作业的依赖结构,发现它们的dag具有稀疏连接的顶点,并且可以近似分解为深度有界的多棵树。我们还描述了dag的运行时性能,并表明依赖任务在资源使用和持续时间方面可能具有显著的可变性——即使对于重复出现的任务也是如此。在这两个方面,我们将标准TPC基准测试中的查询作业与生产dag进行了比较,发现前者的代表性不足。为了更好地对大规模DAG调度器进行基准测试,我们开发了一个工作负载生成器,它可以根据阿里巴巴的生产跟踪忠实地合成任务依赖关系。大量的评估表明,合成的dag与生产dag具有一致的统计特征,并且使用各种调度程序,合成的工作负载和实际工作负载产生相似的调度结果。
{"title":"Characterizing and Synthesizing Task Dependencies of Data-Parallel Jobs in Alibaba Cloud","authors":"Huangshi Tian, Yunchuan Zheng, Wei Wang","doi":"10.1145/3357223.3362710","DOIUrl":"https://doi.org/10.1145/3357223.3362710","url":null,"abstract":"Cluster schedulers routinely face data-parallel jobs with complex task dependencies expressed as DAGs (directed acyclic graphs). Understanding DAG structures and runtime characteristics in large production clusters hence plays a key role in scheduler design, which, however, remains an important missing piece in the literature. In this work, we present a comprehensive study of a recently released cluster trace in Alibaba. We examine the dependency structures of Alibaba jobs and find that their DAGs have sparsely connected vertices and can be approximately decomposed into multiple trees with bounded depth. We also characterize the runtime performance of DAGs and show that dependent tasks may have significant variability in resource usage and duration---even for recurring tasks. In both aspects, we compare the query jobs in the standard TPC benchmarks with the production DAGs and find the former inadequately representative. To better benchmark DAG schedulers at scale, we develop a workload generator that can faithfully synthesize task dependencies based on the production Alibaba trace. Extensive evaluations show that the synthesized DAGs have consistent statistical characteristics as the production DAGs, and the synthesized and real workloads yield similar scheduling results with various schedulers.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"175 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73661940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Analysis of and Optimization for Write-dominated Hybrid Storage Nodes in Cloud 云中以写为主的混合存储节点分析与优化
Shuyang Liu, Shucheng Wang, Q. Cao, Ziyi Lu, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang
Cloud providers like the Alibaba cloud routinely and widely employ hybrid storage nodes composed of solid-state drives (SSDs) and hard disk drives (HDDs), reaping their respective benefits: performance from SSD and capacity from HDD. These hybrid storage nodes generally write incoming data to its SSDs and then flush them to their HDD counterparts, referred to as the SSD Write Back (SWB) mode, thereby ensuring low write latency. When comprehensively analyzing real production workloads from Pangu, a large-scale storage platform underlying the Alibaba cloud, we find that (1) there exist many write dominated storage nodes (WSNs); however, (2) under the SWB mode, the SSDs of these WSNs suffer from severely high write intensity and long tail latency. To address these unique observed problems of WSNs, we present SSD Write Redirect (SWR), a runtime IO scheduling mechanism for WSNs. SWR judiciously and selectively forwards some or all SSD-writes to HDDs, adapting to runtime conditions. By effectively offloading the right amount of write IOs from overburdened SSDs to underutilized HDDs in WSNs, SWR is able to adequately alleviate the aforementioned problems suffered by WSNs. This significantly improves overall system performance and SSD endurance. Our trace-driven evaluation of SWR, through replaying production workload traces collected from the Alibaba cloud in our cloud testbed, shows that SWR decreases the average and 99til-percentile latencies of SSD-writes by up to 13% and 47% respectively, notably improving system performance. Meanwhile the amount of data written to SSDs is reduced by up to 70%, significantly improving SSD lifetime.
像阿里云这样的云提供商经常并广泛地使用由固态硬盘(SSD)和硬盘驱动器(HDD)组成的混合存储节点,获得各自的优势:SSD的性能和HDD的容量。这些混合存储节点通常将传入的数据写入其SSD,然后将其刷新到对应的HDD,称为SSD write Back (SWB)模式,从而确保较低的写入延迟。综合分析盘古(阿里云下的大型存储平台)的实际生产工作负载,我们发现:(1)存在许多写主导存储节点(wsn);然而,(2)在SWB模式下,这些wsn的ssd存在严重的高写强度和长尾延迟。为了解决这些独特的观察到的问题,我们提出了SSD写重定向(SWR),一种用于wsn的运行时IO调度机制。SWR明智地有选择地将部分或全部ssd写入转发到hdd,以适应运行时条件。通过在wsn中有效地将适当数量的写io从负担过重的ssd上卸载到未充分利用的hdd上,SWR能够充分缓解wsn遭受的上述问题。这大大提高了整体系统性能和SSD耐用性。通过在我们的云测试平台中重播从阿里云收集的生产工作负载跟踪,我们对SWR的跟踪驱动评估表明,SWR将ssd写入的平均延迟和99%延迟分别降低了13%和47%,显著提高了系统性能。同时,写入SSD的数据量最多减少70%,显著提高SSD寿命。
{"title":"Analysis of and Optimization for Write-dominated Hybrid Storage Nodes in Cloud","authors":"Shuyang Liu, Shucheng Wang, Q. Cao, Ziyi Lu, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1145/3357223.3362705","DOIUrl":"https://doi.org/10.1145/3357223.3362705","url":null,"abstract":"Cloud providers like the Alibaba cloud routinely and widely employ hybrid storage nodes composed of solid-state drives (SSDs) and hard disk drives (HDDs), reaping their respective benefits: performance from SSD and capacity from HDD. These hybrid storage nodes generally write incoming data to its SSDs and then flush them to their HDD counterparts, referred to as the SSD Write Back (SWB) mode, thereby ensuring low write latency. When comprehensively analyzing real production workloads from Pangu, a large-scale storage platform underlying the Alibaba cloud, we find that (1) there exist many write dominated storage nodes (WSNs); however, (2) under the SWB mode, the SSDs of these WSNs suffer from severely high write intensity and long tail latency. To address these unique observed problems of WSNs, we present SSD Write Redirect (SWR), a runtime IO scheduling mechanism for WSNs. SWR judiciously and selectively forwards some or all SSD-writes to HDDs, adapting to runtime conditions. By effectively offloading the right amount of write IOs from overburdened SSDs to underutilized HDDs in WSNs, SWR is able to adequately alleviate the aforementioned problems suffered by WSNs. This significantly improves overall system performance and SSD endurance. Our trace-driven evaluation of SWR, through replaying production workload traces collected from the Alibaba cloud in our cloud testbed, shows that SWR decreases the average and 99til-percentile latencies of SSD-writes by up to 13% and 47% respectively, notably improving system performance. Meanwhile the amount of data written to SSDs is reduced by up to 70%, significantly improving SSD lifetime.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80160402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost 微软的大数据处理:超大规模、巨大复杂性和最低成本
Hiren Patel, Alekh Jindal, C. Szyperski
The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data f
在过去的十年里,微软对大规模数据处理产生了极大的兴趣。典型的场景包括为Bing构建关键业务管道,如广告商反馈循环、索引构建器和相关/排名算法;分析Office、Windows或Xbox的用户体验遥测;收集Windows和Xbox等产品的推荐。为了满足这些需求,微软在2010年初开发了一个名为Cosmos的第一方大数据分析平台。Cosmos使得以eb级规模存储数据并以无服务器的形式进行处理成为可能,SCOPE[4]是查询处理的主力。然而,随着时间的推移,出现了一些新的挑战,要求Cosmos进行重大技术创新以满足这些新的需求。在这篇摘要中,我们从查询处理的角度描述了三个这样的挑战,以及我们处理它们的方法。超规模。从早期的用户数量(从Bing到今天微软的几乎每一个业务部门),到处理的数据量(从今天的pb到今天的eb),再到处理的数量(从数万个SCOPE作业到今天的数十万个作业,跨越数十万台机器),Cosmos的使用量都有了显著的增长。即使是单个作业也可能消耗数十pb的数据,并通过并行运行数百万个任务产生类似的数据量。我们处理这一空前规模的方法有两个方面。首先,我们将查询处理器与存储和资源管理组件解耦并分解,从而允许Cosmos堆栈中的不同组件独立扩展。其次,我们以拟线性复杂度[2]缩放SCOPE查询处理器中的数据移动。这是至关重要的,因为在大规模数据处理中,数据移动通常是最昂贵的步骤,因此也是瓶颈。巨大的复杂性。Cosmos工作负载也非常复杂。由于整个微软的采用,Cosmos需要支持多个行业领域的工作负载,包括搜索引擎(Bing)、操作系统(Windows)、工作效率(Office)、个人计算(Surface)、游戏(XBox)等。为了处理如此多样化的工作负载,我们的方法是提供一刀切的体验。首先,为了方便客户表达他们的计算,SCOPE支持不同类型的查询,从批处理到交互,再到流和机器学习。其次,SCOPE同时支持结构化和非结构化数据处理。同样,支持多种数据格式,包括专有的和开放源码的,如Parquet。第三,用户可以在同一个任务中混合使用声明式和命令式语言编写业务逻辑,甚至可以使用不同的命令式语言(如c#和Python)。此外,用户可以用简单的数据流样式计算来表达上述所有内容,从而提高可读性和可维护性。最后,考虑到Microsoft内部不同的工作负载组合,我们已经意识到使用SCOPE不可能适合所有场景。因此,我们还支持流行的Spark查询处理引擎。总的来说,Cosmos中一刀切的查询处理体验涵盖了非常不同的工作负载,包括数据格式、编程语言和后端引擎。最小的成本。虽然规模和复杂性本身就很困难,但最大的挑战是以最小的成本实现所有这些目标。事实上,提高Cosmos的效率和降低运营成本是迫切需要的。由于几个原因,这是具有挑战性的。首先,考虑到SCOPE dag非常大(单个作业中多达1000个操作符!),优化SCOPE作业非常困难,而且优化估计(基数、成本等)通常与实际情况相差甚远。其次,SCOPE优化给定的查询,而操作成本取决于总体工作负载。因此,工作负载优化变得非常重要。最后,SCOPE作业通常在数据管道中相互链接,也就是说,一个作业的输出由其他作业使用。这意味着工作负载优化需要意识到这些依赖关系。我们的方法是建立一个反馈回路,从过去的工作量中学习,以优化未来的工作量。具体来说,我们利用机器学习来学习优化单个作业[3]的模型,应用多查询优化来优化总体工作负载[1]的成本,并构建依赖关系图来识别和优化数据管道。
{"title":"Big Data Processing at Microsoft: Hyper Scale, Massive Complexity, and Minimal Cost","authors":"Hiren Patel, Alekh Jindal, C. Szyperski","doi":"10.1145/3357223.3366029","DOIUrl":"https://doi.org/10.1145/3357223.3366029","url":null,"abstract":"The past decade has seen a tremendous interest in large-scale data processing at Microsoft. Typical scenarios include building business-critical pipelines such as advertiser feedback loop, index builder, and relevance/ranking algorithms for Bing; analyzing user experience telemetry for Office, Windows or Xbox; and gathering recommendations for products like Windows and Xbox. To address these needs a first-party big data analytics platform, referred to as Cosmos, was developed in the early 2010s at Microsoft. Cosmos makes it possible to store data at exabyte scale and process in a serverless form factor, with SCOPE [4] being the query processing workhorse. Over time, however, several newer challenges have emerged, requiring major technical innovations in Cosmos to meet these newer demands. In this abstract, we describe three such challenges from the query processing viewpoint, and our approaches to handling them. Hyper Scale. Cosmos has witnessed a significant growth in usage from its early days, from the number of customers (starting from Bing to almost every single business unit at Microsoft today), to the volume of data processed (from petabytes to exabytes today), to the amount of processing done (from tens of thousands of SCOPE jobs to hundreds of thousands of jobs today, across hundreds of thousands of machines). Even a single job can consume tens of petabytes of data and produce similar volumes of data by running millions of tasks in parallel. Our approach to handle this unprecedented scale is two fold. First, we decoupled and disaggregated the query processor from storage and resource management components, thereby allowing different components in the Cosmos stack to scale independently. Second, we scaled the data movement in the SCOPE query processor with quasilinear complexity [2]. This is crucial since data movement is often the most expensive step, and hence the bottleneck, in massive-scale data processing. Massive Complexity. Cosmos workloads are also highly complex. Thanks to adoption across the whole of Microsoft, Cosmos needs to support workloads that are representative of multiple industry segments, including search engine (Bing), operating system (Windows), workplace productivity (Office), personal computing (Surface), gaming (XBox), etc. To handle such diverse workloads, our approach has been to provide a one-size-fits-all experience. First of all, to make it easy for the customers to express their computations, SCOPE supports different types of queries, from batch to interactive to streaming and machine learning. Second, SCOPE supports both structured and unstructured data processing. Likewise, multiple data formats, including both propriety and open source source such as Parquet, are supported. Third, users can write business logic using a mix of declarative and imperative languages, over even different imperative languages such as C# and Python, in the same job. Furthermore, users can express all of the above in simple data f","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76185000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Centralized Core-granular Scheduling for Serverless Functions 无服务器功能的集中式核心粒度调度
Kostis Kaffes, N. Yadwadkar, C. Kozyrakis
In recent years, many applications have started using serverless computing platforms primarily due to the ease of deployment and cost efficiency they offer. However, the existing scheduling mechanisms of serverless platforms fall short in catering to the unique characteristics of such applications: burstiness, short and variable execution times, statelessness and use of a single core. Specifically, the existing mechanisms fall short in meeting the requirements generated due to the combined effect of these characteristics: scheduling at a scale of millions of function invocations per second while achieving predictable performance. In this paper, we argue for a cluster-level centralized and core-granular scheduler for serverless functions. By maintaining a global view of the cluster resources, the centralized approach eliminates queue imbalances while the core granularity reduces interference; together these properties enable reduced performance variability. We expect such a scheduler to increase the adoption of serverless computing platforms by various latency and throughput sensitive applications.
近年来,许多应用程序已经开始使用无服务器计算平台,主要是因为它们提供了易于部署和成本效率。然而,无服务器平台的现有调度机制在满足此类应用程序的独特特征方面存在不足:突发性、短而可变的执行时间、无状态和使用单核。具体来说,现有的机制无法满足由于以下特征的综合影响而产生的需求:在实现可预测性能的同时,以每秒数百万次函数调用的规模进行调度。在本文中,我们主张为无服务器功能提供集群级集中式和核心粒度调度器。通过维护集群资源的全局视图,集中式方法消除了队列不平衡,同时核心粒度减少了干扰;这些属性一起可以减少性能的可变性。我们期望这样的调度器能够增加各种延迟和吞吐量敏感的应用程序对无服务器计算平台的采用。
{"title":"Centralized Core-granular Scheduling for Serverless Functions","authors":"Kostis Kaffes, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3357223.3362709","DOIUrl":"https://doi.org/10.1145/3357223.3362709","url":null,"abstract":"In recent years, many applications have started using serverless computing platforms primarily due to the ease of deployment and cost efficiency they offer. However, the existing scheduling mechanisms of serverless platforms fall short in catering to the unique characteristics of such applications: burstiness, short and variable execution times, statelessness and use of a single core. Specifically, the existing mechanisms fall short in meeting the requirements generated due to the combined effect of these characteristics: scheduling at a scale of millions of function invocations per second while achieving predictable performance. In this paper, we argue for a cluster-level centralized and core-granular scheduler for serverless functions. By maintaining a global view of the cluster resources, the centralized approach eliminates queue imbalances while the core granularity reduces interference; together these properties enable reduced performance variability. We expect such a scheduler to increase the adoption of serverless computing platforms by various latency and throughput sensitive applications.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74093879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
Hotspot Mitigations for the Masses 针对大众的热点缓解
Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Shukla, Ramesh Chandra
In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible. The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.
在IaaS云中,动态VM调度器观察并缓解资源热点,以便在过度订阅的环境中保持性能。大多数系统都专注于适合大型基础设施的调度器,这会导致依赖于工作负载的优化,从而限制了它们的可移植性。然而,尽管大规模公共云的数量非常少,但运行不同工作负载的私有云却不计其数。在这种情况下,我们认为有必要寻找能够克服私有云中观察到的工作负载多样性的调度器,以使尽可能多的用例受益。Acropolis动态调度器(ADS)缓解了由Acropolis操作系统管理的私有云中的热点。在本文中,我们回顾了ADS的设计和实现以及自2017年以来我们为提高其可移植性而进行的主要更改。我们依靠成千上万的客户集群跟踪来说明变更背后的动机,重新审视现有的方法,在需要时提出替代方案,并确定它们各自的好处。最后,我们从工程的角度讨论了经验教训。
{"title":"Hotspot Mitigations for the Masses","authors":"Fabien Hermenier, Aditya Ramesh, Abhinay Nagpal, Himanshu Shukla, Ramesh Chandra","doi":"10.1145/3357223.3362717","DOIUrl":"https://doi.org/10.1145/3357223.3362717","url":null,"abstract":"In an IaaS cloud, the dynamic VM scheduler observes and mitigates resource hotspots to maintain performance in an oversubscribed environment. Most systems are focused on schedulers that fit very large infrastructures, which lead to workload-dependent optimisations, thereby limiting their portability. However, while the number of massive public clouds is very small, there is a countless number of private clouds running very different workloads. In that context, we consider that it is essential to look for schedulers that overcome the workload diversity observed in private clouds to benefit as many use cases as possible. The Acropolis Dynamic Scheduler (ADS) mitigates hotspots in private clouds managed by the Acropolis Operating System. In this paper, we review the design and implementation of ADS and the major changes we made since 2017 to improve its portability. We rely on thousands of customer cluster traces to illustrate the motivation behind the changes, revisit existing approaches, propose alternatives when needed and qualify their respective benefits. Finally, we discuss the lessons learned from an engineering point of view.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77626834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Collaborative Edge-Cloud and Edge-Edge Video Analytics 协同边缘云和边缘视频分析
Samaa Gazzaz, Faisal Nawab
According to YouTube statistics [1], more than 400 hours of content is uploaded to its platform every minute. At this rate, it is estimated that it would take more than 70 years of continuous watch time in order to view all content on YouTube, assuming no more content is uploaded. This raises great challenges when attempting to actively process and analyze video content. Real-time video processing is a critical element that brings forth numerous applications otherwise infeasible due to scalability constraints. Predictive models are commonly used, specifically Neural Networks (NNs), to accelerate processing time when analyzing realtime content. However, applying NNs is computationally expensive. Advanced hardware (e.g. graphics processing units or GPUs) and cloud infrastructure are usually utilized to meet the demand of processing applications. Nevertheless, recent work in the field of edge computing aims to develop systems that relieve the load on the cloud by delegating parts of the job to edge nodes. Such systems emphasize processing as much as possible within the edge node before delegating the load to the cloud in hopes of reducing the latency. In addition, processing content in the edge promotes the privacy and security of the data. One example is the work by Grulich et al. [2] where the edge node relieves some of the work load off the cloud by splitting, differentiating and compressing the NN used to analyze the content. Even though the collaboration between the edge node and the cloud expedites the processing time by relying on the edge node's capability, there is still room for improvement. Our proposal aims to utilize the edge nodes even further by allowing the nodes to collaborate among themselves as a para-cloud that minimizes the dependency on the primary processing cloud. We propose a collaborative system solution where a video uploaded on an edge node could be labeled and analyzed collaboratively without the need to utilize cloud resources. The proposed collaborative system is illustrated in Figure 1. The system consists of multiple edge nodes that acquire video content from different sources. Each node starts the analysis process via a specialized, smaller NN [3] utilizing the edge node's processing power. Whenever the load overwhelms the node or the node is unable to provide accurate analysis via its specialized NN, the node requests other edge nodes to collaborate on the analysis instead of delegating to the cloud resources. This way the high latency is avoided and other edge node processing power is utilized by splitting the NN among the different edge nodes and distributing the processing load between them. The main contribution of this proposed approach is the alternative conceptualization of collaborative computing: instead of building a system that allows collaboration between edge nodes and the cloud, we explore the prospective of collaboration between edge nodes, minimizing the involvement of the cloud resources even furth
根据YouTube的统计数据b[1],每分钟有超过400小时的内容上传到其平台。按照这个速度,如果不上传更多的内容,估计需要70年以上的连续观看时间才能看完YouTube上的所有内容。这在试图主动处理和分析视频内容时提出了巨大的挑战。实时视频处理是一个关键因素,它带来了许多应用,否则由于可扩展性的限制是不可行的。在分析实时内容时,通常使用预测模型,特别是神经网络(NNs)来加快处理时间。然而,应用神经网络在计算上是昂贵的。高级硬件(例如图形处理单元或gpu)和云基础设施通常用于满足处理应用程序的需求。然而,最近在边缘计算领域的工作旨在开发通过将部分工作委派给边缘节点来减轻云上负载的系统。这样的系统强调在将负载委托给云之前尽可能多地在边缘节点内进行处理,以期减少延迟。此外,在边缘处理内容提高了数据的隐私性和安全性。一个例子是Grulich等人的工作[2],其中边缘节点通过拆分、区分和压缩用于分析内容的神经网络来减轻云上的一些工作负载。尽管边缘节点和云之间的协作依靠边缘节点的能力加快了处理时间,但仍有改进的空间。我们的建议旨在进一步利用边缘节点,允许节点之间作为一个准云进行协作,从而最大限度地减少对主处理云的依赖。我们提出了一种协作系统解决方案,在不需要利用云资源的情况下,可以在边缘节点上上传视频进行协作标记和分析。提议的协作系统如图1所示。该系统由多个边缘节点组成,这些节点从不同的来源获取视频内容。每个节点利用边缘节点的处理能力,通过一个专门的、较小的NN[3]开始分析过程。每当负载超过节点或节点无法通过其专门的神经网络提供准确的分析时,节点就会请求其他边缘节点协作进行分析,而不是委托给云资源。这种方法通过在不同的边缘节点之间划分神经网络并在它们之间分配处理负载,避免了高延迟,并利用了其他边缘节点的处理能力。这种方法的主要贡献是协作计算的另一种概念化:我们不是构建一个允许边缘节点和云之间协作的系统,而是探索边缘节点之间协作的前景,进一步减少云资源的参与。
{"title":"Collaborative Edge-Cloud and Edge-Edge Video Analytics","authors":"Samaa Gazzaz, Faisal Nawab","doi":"10.1145/3357223.3366024","DOIUrl":"https://doi.org/10.1145/3357223.3366024","url":null,"abstract":"According to YouTube statistics [1], more than 400 hours of content is uploaded to its platform every minute. At this rate, it is estimated that it would take more than 70 years of continuous watch time in order to view all content on YouTube, assuming no more content is uploaded. This raises great challenges when attempting to actively process and analyze video content. Real-time video processing is a critical element that brings forth numerous applications otherwise infeasible due to scalability constraints. Predictive models are commonly used, specifically Neural Networks (NNs), to accelerate processing time when analyzing realtime content. However, applying NNs is computationally expensive. Advanced hardware (e.g. graphics processing units or GPUs) and cloud infrastructure are usually utilized to meet the demand of processing applications. Nevertheless, recent work in the field of edge computing aims to develop systems that relieve the load on the cloud by delegating parts of the job to edge nodes. Such systems emphasize processing as much as possible within the edge node before delegating the load to the cloud in hopes of reducing the latency. In addition, processing content in the edge promotes the privacy and security of the data. One example is the work by Grulich et al. [2] where the edge node relieves some of the work load off the cloud by splitting, differentiating and compressing the NN used to analyze the content. Even though the collaboration between the edge node and the cloud expedites the processing time by relying on the edge node's capability, there is still room for improvement. Our proposal aims to utilize the edge nodes even further by allowing the nodes to collaborate among themselves as a para-cloud that minimizes the dependency on the primary processing cloud. We propose a collaborative system solution where a video uploaded on an edge node could be labeled and analyzed collaboratively without the need to utilize cloud resources. The proposed collaborative system is illustrated in Figure 1. The system consists of multiple edge nodes that acquire video content from different sources. Each node starts the analysis process via a specialized, smaller NN [3] utilizing the edge node's processing power. Whenever the load overwhelms the node or the node is unable to provide accurate analysis via its specialized NN, the node requests other edge nodes to collaborate on the analysis instead of delegating to the cloud resources. This way the high latency is avoided and other edge node processing power is utilized by splitting the NN among the different edge nodes and distributing the processing load between them. The main contribution of this proposed approach is the alternative conceptualization of collaborative computing: instead of building a system that allows collaboration between edge nodes and the cloud, we explore the prospective of collaboration between edge nodes, minimizing the involvement of the cloud resources even furth","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80151930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Peregrine: Workload Optimization for Cloud Query Engines Peregrine:云查询引擎的工作负载优化
Alekh Jindal, Hiren Patel, Abhishek Roy, S. Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan
Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users. We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.
数据库管理员(dba)传统上负责优化本地数据库工作负载。然而,随着云数据服务的兴起,云提供商提供完全托管的数据处理能力,DBA的角色完全消失了。同时,优化查询工作负载对于降低操作的总成本和使云中的数据处理在经济上可行变得越来越重要。本文将在这些新兴的基于云的数据服务的背景下重新讨论工作负载优化。我们观察到,在这些较新的数据服务中缺少DBA已经影响了最终用户和系统开发人员:用户将工作负载优化作为一个主要的痛点,而系统开发人员现在的任务是支持大量的云用户。我们介绍Peregrine,这是一个云查询引擎的工作负载优化平台,我们一直在为微软的大数据分析基础设施开发。游隼有三个主要贡献:(i)一种表示查询工作负载的新方法,它与查询引擎无关,并且足够通用,可以描述各种各样的工作负载;(ii)典型工作负载模式的分类,源自Microsoft的生产工作负载,以及每个类别中可能的相应工作负载优化;(iii)通过在编译时向查询引擎提供查询注释的概念,为查询引擎添加工作负载感知的方法。我们讨论了Peregrine的一个案例研究,该案例使用了对两个查询引擎的两种优化,即Scope和Spark。Peregrine帮助将开发新的工作负载优化特性的时间从几年缩短到几个月,使微软的研究团队、产品团队和客户受益。
{"title":"Peregrine: Workload Optimization for Cloud Query Engines","authors":"Alekh Jindal, Hiren Patel, Abhishek Roy, S. Qiao, Zhicheng Yin, Rathijit Sen, Subru Krishnan","doi":"10.1145/3357223.3362726","DOIUrl":"https://doi.org/10.1145/3357223.3362726","url":null,"abstract":"Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services, where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, optimizing query workloads is becoming increasingly important for reducing the total costs of operation and making data processing economically viable in the cloud. This paper revisits workload optimization in the context of these emerging cloud-based data services. We observe that the missing DBA in these newer data services has affected both the end users and the system developers: users have workload optimization as a major pain point while the system developers are now tasked with supporting a large base of cloud users. We present Peregrine, a workload optimization platform for cloud query engines that we have been developing for the big data analytics infrastructure at Microsoft. Peregrine makes three major contributions: (i) a novel way of representing query workloads that is agnostic to the query engine and is general enough to describe a large variety of workloads, (ii) a categorization of the typical workload patterns, derived from production workloads at Microsoft, and the corresponding workload optimizations possible in each category, and (iii) a prescription for adding workload-awareness to a query engine, via the notion of query annotations that are served to the query engine at compile time. We discuss a case study of Peregrine using two optimizations over two query engines, namely Scope and Spark. Peregrine has helped cut the time to develop new workload optimization features from years to months, benefiting the research teams, the product teams, and the customers at Microsoft.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80975994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
C-LSM: Cooperative Log Structured Merge Trees C-LSM:协作日志结构合并树
Natasha Mittal, Faisal Nawab
The basic structure of the LSM[3] tree consists of four levels (we are considering only 4 levels), L0 in memory, and L1 to L3 in the disk. Compaction in L0/L1 is done through tiering, and compaction in the rest of the tree is done through leveling. Cooperative-LSM (C-LSM) is implemented by deconstructing the monolithic structure of LSM[3] trees to enhance the scalability of LSM trees by utilizing the resources of multiple machines in a more flexible way. The monolithic structure of LSM[3] tree lacks flexibility, and the only way to deal with an increased load on is to re-partition the data and distribute it across nodes. C-LSM comprises of three components - leader, compactor, and backup. Leader node receives write requests. It maintains Levels L0 and L1 of the LSM tree and performs minor compactions. Compactor maintains the rest of the levels (L2 and L3) and is responsible for compacting them. Backup maintains a copy of the entire LSM tree for fault-tolerance and read availability. The advantages that C-LSM provides are two-fold: • one can place components on different machines, and • one can have more than one instance of each component Running more than one instance for each component can enable various performance advantages: • Increasing the number of Leaders enables to digest data faster because the performance of a single machine no longer limits the system. • Increasing the number of Compactors enables to offload compaction[1] to more nodes and thus reduce the impact of compaction on other functions. • Increasing the number of backups increases read availability. Although, all these advantages can be achieved by re-partitioning the data and distributing the partitions across nodes, which most current LSM variants do. However, we hypothesize that partitioning is not feasible for all cases. For example, a dynamic workload where access patterns are unpredictable and no clear partitioning is feasible. In this case, the developer either has to endure the overhead of re-partitioning the data all the time or not be able to utilize the system resources efficiently if no re-partitioning is done. C-LSM enables scaling (and down-scaling) with less overhead compared to re-partitioning; if a partition is suddenly getting more requests, one can simply add a new component on another node. Each one of the components has different characteristics in terms of how it affects the workload and I/O. By having the flexibility to break down the components, one can find ways to distribute them in a way to increase overall efficiency. Having multiple instances of the three components leads to interesting challenges in terms of how to ensure that they work together without leading to any inconsistencies. We are trying to solve this through careful design of how these components interact and how to manage the decisions when failures or scaling events happen. Another interesting problem to solve is having multiple instances of C-LSM, each dedicated to one edge node o
LSM[3]树的基本结构包括四个级别(我们只考虑4个级别),内存中的L0和磁盘中的L1到L3。L0/L1中的压缩是通过分层完成的,其余树中的压缩是通过调平完成的。C-LSM (Cooperative-LSM, C-LSM)是通过解构LSM[3]树的整体结构来实现的,通过更灵活地利用多台机器的资源来增强LSM树的可扩展性。LSM[3]树的整体结构缺乏灵活性,处理负载增加的唯一方法是对数据进行重新分区并跨节点分布。C-LSM由引线、压实机和备份三部分组成。Leader节点接收写请求。它维护LSM树的L0和L1级,并执行较小的压缩。Compactor维护其余的层(L2和L3)并负责压缩它们。备份维护整个LSM树的副本,以实现容错性和读可用性。C-LSM提供了双重优势:•可以将组件放置在不同的机器上,并且•每个组件可以有多个实例,每个组件运行多个实例可以实现各种性能优势:•增加leader的数量可以更快地消化数据,因为单个机器的性能不再限制系统。•增加压缩器的数量可以将压缩[1]卸载到更多的节点上,从而减少压缩对其他功能的影响。•增加备份数量可以提高读可用性。尽管如此,所有这些优点都可以通过重新划分数据并跨节点分布分区来实现,大多数当前的LSM变体都是这样做的。然而,我们假设分区并非对所有情况都可行。例如,在访问模式不可预测且没有明确分区的动态工作负载中是可行的。在这种情况下,开发人员要么必须一直忍受重新分区数据的开销,要么在不进行重新分区的情况下无法有效地利用系统资源。与重新分区相比,C-LSM支持以更少的开销进行扩展(和向下扩展);如果分区突然收到更多请求,可以简单地在另一个节点上添加一个新组件。就影响工作负载和I/O的方式而言,每个组件都具有不同的特征。通过灵活地分解组件,可以找到以提高整体效率的方式分配它们的方法。拥有这三个组件的多个实例会带来有趣的挑战,即如何确保它们一起工作而不会导致任何不一致。我们正试图通过仔细设计这些组件的交互方式,以及在发生故障或扩展事件时如何管理决策来解决这个问题。另一个需要解决的有趣问题是拥有多个C-LSM实例,每个实例专用于一个边缘节点或一个边缘节点集群。对于基于移动或实时的数据分析应用,越来越多的数据需要在边缘节点[2]本身进行处理,拥有专用的C-LSM将提高整体延迟。多个组件也有一些缺点需要解决。例如,拥有多个压缩服务器导致需要跨机器压缩和/或数据冗余,或者拥有多个leader需要维护线性化访问。
{"title":"C-LSM: Cooperative Log Structured Merge Trees","authors":"Natasha Mittal, Faisal Nawab","doi":"10.1145/3357223.3365443","DOIUrl":"https://doi.org/10.1145/3357223.3365443","url":null,"abstract":"The basic structure of the LSM[3] tree consists of four levels (we are considering only 4 levels), L0 in memory, and L1 to L3 in the disk. Compaction in L0/L1 is done through tiering, and compaction in the rest of the tree is done through leveling. Cooperative-LSM (C-LSM) is implemented by deconstructing the monolithic structure of LSM[3] trees to enhance the scalability of LSM trees by utilizing the resources of multiple machines in a more flexible way. The monolithic structure of LSM[3] tree lacks flexibility, and the only way to deal with an increased load on is to re-partition the data and distribute it across nodes. C-LSM comprises of three components - leader, compactor, and backup. Leader node receives write requests. It maintains Levels L0 and L1 of the LSM tree and performs minor compactions. Compactor maintains the rest of the levels (L2 and L3) and is responsible for compacting them. Backup maintains a copy of the entire LSM tree for fault-tolerance and read availability. The advantages that C-LSM provides are two-fold: • one can place components on different machines, and • one can have more than one instance of each component Running more than one instance for each component can enable various performance advantages: • Increasing the number of Leaders enables to digest data faster because the performance of a single machine no longer limits the system. • Increasing the number of Compactors enables to offload compaction[1] to more nodes and thus reduce the impact of compaction on other functions. • Increasing the number of backups increases read availability. Although, all these advantages can be achieved by re-partitioning the data and distributing the partitions across nodes, which most current LSM variants do. However, we hypothesize that partitioning is not feasible for all cases. For example, a dynamic workload where access patterns are unpredictable and no clear partitioning is feasible. In this case, the developer either has to endure the overhead of re-partitioning the data all the time or not be able to utilize the system resources efficiently if no re-partitioning is done. C-LSM enables scaling (and down-scaling) with less overhead compared to re-partitioning; if a partition is suddenly getting more requests, one can simply add a new component on another node. Each one of the components has different characteristics in terms of how it affects the workload and I/O. By having the flexibility to break down the components, one can find ways to distribute them in a way to increase overall efficiency. Having multiple instances of the three components leads to interesting challenges in terms of how to ensure that they work together without leading to any inconsistencies. We are trying to solve this through careful design of how these components interact and how to manage the decisions when failures or scaling events happen. Another interesting problem to solve is having multiple instances of C-LSM, each dedicated to one edge node o","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"170 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79394621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1