首页 > 最新文献

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献

英文 中文
Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications Faa$T:用于无服务器应用程序的透明自动缩放缓存
Francisco Romero, G. Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, N. Yadwadkar, R. Fonseca, C. Kozyrakis, R. Bianchini
Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short: it disregards the widely different characteristics of FaaS applications, does not scale the cache based on data access patterns, or requires changes to applications. To address these limitations, we present Faa$T, a transparent auto-scaling distributed cache for serverless applications. Each application gets its own cache. After a function executes and the application becomes inactive, the cache is unloaded from memory with the application. Upon reloading for the next invocation, Faa$T pre-warms the cache with objects likely to be accessed. In addition to traditional compute-based scaling, Faa$T scales based on working set and object sizes to manage cache space and I/O bandwidth. We motivate our design with a comprehensive study of data access patterns on Azure Functions. We implement Faa$T for Azure Functions, and show that Faa$T can improve performance by up to 92% (57% on average) for challenging applications, and reduce cost for most users compared to state-of-the-art caching systems, i.e. the cost of having to stand up additional serverful resources.
功能即服务(FaaS)已经成为用户部署应用程序而无需管理底层基础设施的一种日益流行的方式。然而,现有的FaaS平台依赖于远程存储来维护状态,从而限制了能够高效运行的应用程序集。最近针对FaaS平台的缓存工作试图解决这个问题,但是做得不够:它忽略了FaaS应用程序的广泛不同特征,没有根据数据访问模式扩展缓存,或者需要对应用程序进行更改。为了解决这些限制,我们提出了Faa$T,这是一种用于无服务器应用程序的透明自动扩展分布式缓存。每个应用程序都有自己的缓存。在函数执行并且应用程序变为非活动状态后,缓存将随应用程序一起从内存中卸载。在为下一次调用重新加载时,Faa$T用可能被访问的对象预先预热缓存。除了传统的基于计算的扩展之外,Faa$T还根据工作集和对象大小进行扩展,以管理缓存空间和I/O带宽。我们通过对Azure函数上的数据访问模式的全面研究来激励我们的设计。我们为Azure Functions实现了Faa$T,并表明Faa$T可以将具有挑战性的应用程序的性能提高高达92%(平均57%),并且与最先进的缓存系统相比,可以降低大多数用户的成本,即必须建立额外的服务器资源的成本。
{"title":"Faa$T: A Transparent Auto-Scaling Cache for Serverless Applications","authors":"Francisco Romero, G. Chaudhry, Íñigo Goiri, Pragna Gopa, Paul Batum, N. Yadwadkar, R. Fonseca, C. Kozyrakis, R. Bianchini","doi":"10.1145/3472883.3486974","DOIUrl":"https://doi.org/10.1145/3472883.3486974","url":null,"abstract":"Function-as-a-Service (FaaS) has become an increasingly popular way for users to deploy their applications without the burden of managing the underlying infrastructure. However, existing FaaS platforms rely on remote storage to maintain state, limiting the set of applications that can be run efficiently. Recent caching work for FaaS platforms has tried to address this problem, but has fallen short: it disregards the widely different characteristics of FaaS applications, does not scale the cache based on data access patterns, or requires changes to applications. To address these limitations, we present Faa$T, a transparent auto-scaling distributed cache for serverless applications. Each application gets its own cache. After a function executes and the application becomes inactive, the cache is unloaded from memory with the application. Upon reloading for the next invocation, Faa$T pre-warms the cache with objects likely to be accessed. In addition to traditional compute-based scaling, Faa$T scales based on working set and object sizes to manage cache space and I/O bandwidth. We motivate our design with a comprehensive study of data access patterns on Azure Functions. We implement Faa$T for Azure Functions, and show that Faa$T can improve performance by up to 92% (57% on average) for challenging applications, and reduce cost for most users compared to state-of-the-art caching systems, i.e. the cost of having to stand up additional serverful resources.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"171 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76005798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines Llama:用于自动调整视频分析管道的异构和无服务器框架
Francisco Romero, Mark Zhao, N. Yadwadkar, C. Kozyrakis
The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, "Is this intersection congested?" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users' desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average.
支持摄像头的设备和大型视频存储库的激增导致了各种视频分析应用程序的出现。这些应用程序依赖于视频管道,表示为操作的dag,来转换视频,处理提取的元数据,并回答诸如“这个十字路口拥挤吗?”管道的延迟和资源效率可以使用可配置的旋钮来优化每个操作(例如,采样率,批处理大小或使用的硬件类型)。然而,确定有效的配置是具有挑战性的,因为(a)配置搜索空间是指数级的,(b)最优配置取决于用户期望的延迟和成本目标,(c)输入的视频内容可能在DAG中运行不同的路径,并产生不同数量的中间结果。现有的视频分析和处理系统让用户手动配置操作和选择硬件资源。我们提出了Llama:一个异构和无服务器的框架,用于自动调整视频管道。给定端到端延迟目标,Llama通过(a)为每个操作调用计算延迟目标,以及(b)动态运行基于成本的优化器来跨异构硬件分配最能满足计算出的每次调用延迟目标的配置,从而优化成本效率。这使得自动调整大型视频管道的问题变得容易处理,并允许我们处理依赖于输入的行为、DAG中的条件分支和执行可变性。我们描述了Llama中的算法,并在使用无服务器CPU和GPU资源的云平台上对其进行了评估。我们表明,与最先进的集群和无服务器视频分析和处理系统相比,Llama实现了7.8倍的低延迟和16倍的平均成本降低。
{"title":"Llama: A Heterogeneous & Serverless Framework for Auto-Tuning Video Analytics Pipelines","authors":"Francisco Romero, Mark Zhao, N. Yadwadkar, C. Kozyrakis","doi":"10.1145/3472883.3486972","DOIUrl":"https://doi.org/10.1145/3472883.3486972","url":null,"abstract":"The proliferation of camera-enabled devices and large video repositories has led to a diverse set of video analytics applications. These applications rely on video pipelines, represented as DAGs of operations, to transform videos, process extracted metadata, and answer questions like, \"Is this intersection congested?\" The latency and resource efficiency of pipelines can be optimized using configurable knobs for each operation (e.g., sampling rate, batch size, or type of hardware used). However, determining efficient configurations is challenging because (a) the configuration search space is exponentially large, and (b) the optimal configuration depends on users' desired latency and cost targets, (c) input video contents may exercise different paths in the DAG and produce a variable amount intermediate results. Existing video analytics and processing systems leave it to the users to manually configure operations and select hardware resources. We present Llama: a heterogeneous and serverless framework for auto-tuning video pipelines. Given an end-to-end latency target, Llama optimizes for cost efficiency by (a) calculating a latency target for each operation invocation, and (b) dynamically running a cost-based optimizer to assign configurations across heterogeneous hardware that best meet the calculated per-invocation latency target. This makes the problem of auto-tuning large video pipelines tractable and allows us to handle input-dependent behavior, conditional branches in the DAG, and execution variability. We describe the algorithms in Llama and evaluate it on a cloud platform using serverless CPU and GPU resources. We show that compared to state-of-the-art cluster and serverless video analytics and processing systems, Llama achieves 7.8x lower latency and 16x cost reduction on average.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76727494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
SoCC '21: ACM Symposium on Cloud Computing, Seattle, WA, USA, November 1 - 4, 2021 21: ACM云计算研讨会,美国西雅图,华盛顿州,2021年11月1 - 4日
{"title":"SoCC '21: ACM Symposium on Cloud Computing, Seattle, WA, USA, November 1 - 4, 2021","authors":"","doi":"10.1145/3472883","DOIUrl":"https://doi.org/10.1145/3472883","url":null,"abstract":"","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81829390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoCC '20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020 ACM云计算研讨会,虚拟事件,美国,2020年10月19-21日
{"title":"SoCC '20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19-21, 2020","authors":"","doi":"10.1145/3419111","DOIUrl":"https://doi.org/10.1145/3419111","url":null,"abstract":"","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79136188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Grasper 抓紧器
Hongzhi Chen, Changji Li, Juncheng Fang, Chenghuan Huang, James Cheng, Jian Zhang, Yifan Hou, Xiao Yan
The property graph (PG) model is one of the most general graph data model and has been widely adopted in many graph analytics and processing systems. However, existing systems suffer from poor performance in terms of both latency and throughput for processing online analytical workloads on PGs due to their design defects such as expensive interactions with external databases, low parallelism, and high network overheads. In this paper, we propose Grasper, a high performance distributed system for OLAP on property graphs. Grasper adopts RDMA-aware system designs to reduce the network communication cost. We propose a novel query execution model, called Expert Model, which supports adaptive parallelism control at the fine-grained query operation level and allows tailored optimizations for different categories of query operators, thus achieving high parallelism and good load balancing. Experimental results show that Grasper achieves low latency and high throughput on a broad range of online analytical workloads.
{"title":"Grasper","authors":"Hongzhi Chen, Changji Li, Juncheng Fang, Chenghuan Huang, James Cheng, Jian Zhang, Yifan Hou, Xiao Yan","doi":"10.1145/3357223.3362715","DOIUrl":"https://doi.org/10.1145/3357223.3362715","url":null,"abstract":"The property graph (PG) model is one of the most general graph data model and has been widely adopted in many graph analytics and processing systems. However, existing systems suffer from poor performance in terms of both latency and throughput for processing online analytical workloads on PGs due to their design defects such as expensive interactions with external databases, low parallelism, and high network overheads. In this paper, we propose Grasper, a high performance distributed system for OLAP on property graphs. Grasper adopts RDMA-aware system designs to reduce the network communication cost. We propose a novel query execution model, called Expert Model, which supports adaptive parallelism control at the fine-grained query operation level and allows tailored optimizations for different categories of query operators, thus achieving high parallelism and good load balancing. Experimental results show that Grasper achieves low latency and high throughput on a broad range of online analytical workloads.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74071421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Proceedings of the ACM Symposium on Cloud Computing ACM云计算研讨会论文集
{"title":"Proceedings of the ACM Symposium on Cloud Computing","authors":"","doi":"10.1145/3357223","DOIUrl":"https://doi.org/10.1145/3357223","url":null,"abstract":"","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85375924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications Pufferfish:用于数据密集型应用程序的容器驱动弹性内存管理
Wei Chen, Aidi Pi, Shaoqi Wang, Xiaobo Zhou
Data-intensive applications often suffer from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance and reliability. In this paper, we demonstrate how lightweight virtualization via OS containers opens up opportunities to address memory pressure and realize memory elasticity: 1) tasks running in a container can be set to a large heap size to avoid OutOfMemory (OOM) errors, and 2) tasks that are under memory pressure and incur significant swapping activities can be temporarily "suspended" by depriving resources from the hosting containers, and be "resumed" when resources are available. We propose and develop Pufferfish, an elastic memory manager, that leverages containers to flexibly allocate memory for tasks. Memory elasticity achieved by Pufferfish can be exploited by a cluster scheduler to improve cluster utilization and task parallelism. We implement Pufferfish on the cluster scheduler Apache Yarn. Experiments with Spark and MapReduce on real-world traces show Pufferfish is able to avoid OOM errors, improve cluster memory utilization by 2.7x and the median job runtime by 5.5x compared to a memory over-provisioning solution.
数据密集型应用程序通常承受巨大的内存压力,导致过多的垃圾收集(GC)和内存不足(OOM)错误,从而损害系统性能和可靠性。在本文中,我们演示了通过操作系统容器的轻量级虚拟化如何打开解决内存压力和实现内存弹性的机会:1)可以将在容器中运行的任务设置为较大的堆大小,以避免OutOfMemory (OOM)错误;2)处于内存压力下并引发重大交换活动的任务可以通过从托管容器中剥夺资源来暂时“挂起”,并在资源可用时“恢复”。我们提出并开发了Pufferfish,一个弹性内存管理器,它利用容器灵活地为任务分配内存。集群调度器可以利用Pufferfish实现的内存弹性来提高集群利用率和任务并行性。我们在集群调度程序Apache Yarn上实现了Pufferfish。使用Spark和MapReduce进行的实验表明,与内存过度配置解决方案相比,Pufferfish能够避免OOM错误,将集群内存利用率提高2.7倍,将中位数作业运行时间提高5.5倍。
{"title":"Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications","authors":"Wei Chen, Aidi Pi, Shaoqi Wang, Xiaobo Zhou","doi":"10.1145/3357223.3362730","DOIUrl":"https://doi.org/10.1145/3357223.3362730","url":null,"abstract":"Data-intensive applications often suffer from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance and reliability. In this paper, we demonstrate how lightweight virtualization via OS containers opens up opportunities to address memory pressure and realize memory elasticity: 1) tasks running in a container can be set to a large heap size to avoid OutOfMemory (OOM) errors, and 2) tasks that are under memory pressure and incur significant swapping activities can be temporarily \"suspended\" by depriving resources from the hosting containers, and be \"resumed\" when resources are available. We propose and develop Pufferfish, an elastic memory manager, that leverages containers to flexibly allocate memory for tasks. Memory elasticity achieved by Pufferfish can be exploited by a cluster scheduler to improve cluster utilization and task parallelism. We implement Pufferfish on the cluster scheduler Apache Yarn. Experiments with Spark and MapReduce on real-world traces show Pufferfish is able to avoid OOM errors, improve cluster memory utilization by 2.7x and the median job runtime by 5.5x compared to a memory over-provisioning solution.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"67 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85571972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Software Data Planes: You Can't Always Spin to Win 软件数据平面:你不能总是为了赢而旋转
Hossein Golestani, Amirhossein Mirhosseini, T. Wenisch
Today's datacenters demand high-performance, energy-efficient software data planes, which are widely used in many areas including fast network packet processing, network function virtualization, high-speed data transfer in storage systems, and I/O virtualization. Modern software data planes bypass OS I/O stacks and rely on cores spinning on user-level queues as a fast notification mechanism. Whereas spin-polling can improve latency and throughput, it entails significant shortcomings, especially when scaling to large numbers of cores/queues. In this paper, we pinpoint and quantify challenges of spin-polling--based software data planes using Intel's Data Plane Development Kit (DPDK) as a representative infrastructure. We characterize four scalability issues of software data planes: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues; (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced; (3) Operation rate limits (transactions per second) as well as a Polling Tax (the overhead of polling, which is considerable even when operating at saturation throughput) result in poor core scalability. (4) Whereas shared queues can mitigate load imbalance and head-of-line-blocking, synchronization overheads limit their potential benefits. We identify root causes of these issues and discuss solution directions to improve hardware and software abstractions for better performance, efficiency, and scalability in software data planes.
当今的数据中心对高性能、高能效的软件数据平面提出了更高的要求,广泛应用于网络快速数据包处理、网络功能虚拟化、存储系统高速数据传输、I/O虚拟化等领域。现代软件数据平面绕过操作系统I/O堆栈,并依赖于在用户级队列上旋转的内核作为快速通知机制。尽管自旋轮询可以改善延迟和吞吐量,但它有明显的缺点,特别是在扩展到大量内核/队列时。在本文中,我们使用英特尔的数据平面开发工具包(DPDK)作为代表性基础设施,确定并量化了基于自旋轮询的软件数据平面的挑战。我们描述了软件数据平面的四个可扩展性问题:(1)当队列中待处理的工作较少时,全倾斜旋转内核执行更多(无用的)轮询工作;(2)由于处理器缓存容量的限制,自旋轮询随着轮询队列数量的增加而扩大,特别是当流量不平衡时;(3)操作速率限制(每秒事务数)以及轮询税(轮询的开销,即使在饱和吞吐量下运行也是相当大的)导致核心可扩展性差。虽然共享队列可以减轻负载不平衡和排队阻塞,但同步开销限制了它们的潜在好处。我们确定了这些问题的根本原因,并讨论了改进硬件和软件抽象的解决方案方向,以便在软件数据平面中获得更好的性能、效率和可伸缩性。
{"title":"Software Data Planes: You Can't Always Spin to Win","authors":"Hossein Golestani, Amirhossein Mirhosseini, T. Wenisch","doi":"10.1145/3357223.3362737","DOIUrl":"https://doi.org/10.1145/3357223.3362737","url":null,"abstract":"Today's datacenters demand high-performance, energy-efficient software data planes, which are widely used in many areas including fast network packet processing, network function virtualization, high-speed data transfer in storage systems, and I/O virtualization. Modern software data planes bypass OS I/O stacks and rely on cores spinning on user-level queues as a fast notification mechanism. Whereas spin-polling can improve latency and throughput, it entails significant shortcomings, especially when scaling to large numbers of cores/queues. In this paper, we pinpoint and quantify challenges of spin-polling--based software data planes using Intel's Data Plane Development Kit (DPDK) as a representative infrastructure. We characterize four scalability issues of software data planes: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues; (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced; (3) Operation rate limits (transactions per second) as well as a Polling Tax (the overhead of polling, which is considerable even when operating at saturation throughput) result in poor core scalability. (4) Whereas shared queues can mitigate load imbalance and head-of-line-blocking, synchronization overheads limit their potential benefits. We identify root causes of these issues and discuss solution directions to improve hardware and software abstractions for better performance, efficiency, and scalability in software data planes.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81184363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications 用于诊断分布式应用程序中的性能问题的自动化跨层检测框架
E. Ates, Lily Sturmann, Mert Toslali, O. Krieger, Richard Megginson, A. Coskun, Raja R. Sambasivan
Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application's nodes (i.e., records their workflows). It uses the key insight that localizing the sources high performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.
诊断分布式应用程序中的性能问题极具挑战性。一个重要的原因是,很难预先知道在哪里放置仪器来帮助诊断将来可能发生的问题。我们展示了自动化检测框架Pythia的愿景,它与部署的分布式应用程序一起运行。为了响应新观察到的性能问题,Pythia搜索可能的工具选择空间,以启用帮助诊断该问题所需的工具。我们对Pythia的愿景是建立在以工作流为中心的跟踪上,它记录了分布式应用程序节点内部和节点之间处理请求的顺序和时间(即记录它们的工作流)。它使用了一个关键的洞察力,即在期望执行类似的请求的工作流中本地化源高性能变化,从而洞察需要额外的工具的位置。
{"title":"An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications","authors":"E. Ates, Lily Sturmann, Mert Toslali, O. Krieger, Richard Megginson, A. Coskun, Raja R. Sambasivan","doi":"10.1145/3357223.3362704","DOIUrl":"https://doi.org/10.1145/3357223.3362704","url":null,"abstract":"Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instrumentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application's nodes (i.e., records their workflows). It uses the key insight that localizing the sources high performance variation within the workflows of requests that are expected to perform similarly gives insight into where additional instrumentation is needed.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"107 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80797998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Cirrus
João Carreira, P. Fonseca, A. Tumanov, Andrew Zhang, R. Katz
Machine learning (ML) workflows are extremely complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. This complexity makes it challenging for ML users to correctly provision and manage resources and, in practice, constitutes a significant burden that frequently causes over-provisioning and impairs user productivity. Serverless computing is a compelling model to address the resource management problem, in general, but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources. This work proposes Cirrus---an ML framework that automates the end-to-end management of datacenter resources for ML workflows by efficiently taking advantage of serverless infrastructures. Cirrus combines the simplicity of the serverless interface and the scalability of the serverless infrastructure (AWS Lambdas and S3) to minimize user effort. We show a design specialized for both serverless computation and iterative ML training is needed for robust and efficient ML training on serverless infrastructure. Our evaluation shows that Cirrus outperforms frameworks specialized along a single dimension: Cirrus is 100x faster than a general purpose serverless system [36] and 3.75x faster than specialized ML frameworks for traditional infrastructures [49].
{"title":"Cirrus","authors":"João Carreira, P. Fonseca, A. Tumanov, Andrew Zhang, R. Katz","doi":"10.1145/3357223.3362711","DOIUrl":"https://doi.org/10.1145/3357223.3362711","url":null,"abstract":"Machine learning (ML) workflows are extremely complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. This complexity makes it challenging for ML users to correctly provision and manage resources and, in practice, constitutes a significant burden that frequently causes over-provisioning and impairs user productivity. Serverless computing is a compelling model to address the resource management problem, in general, but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources. This work proposes Cirrus---an ML framework that automates the end-to-end management of datacenter resources for ML workflows by efficiently taking advantage of serverless infrastructures. Cirrus combines the simplicity of the serverless interface and the scalability of the serverless infrastructure (AWS Lambdas and S3) to minimize user effort. We show a design specialized for both serverless computation and iterative ML training is needed for robust and efficient ML training on serverless infrastructure. Our evaluation shows that Cirrus outperforms frameworks specialized along a single dimension: Cirrus is 100x faster than a general purpose serverless system [36] and 3.75x faster than specialized ML frameworks for traditional infrastructures [49].","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"99 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78117532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
期刊
Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1