首页 > 最新文献

Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)最新文献

英文 中文
HP-Mapper: A High Performance Storage Driver for Docker Containers HP-Mapper:用于Docker容器的高性能存储驱动程序
Fan Guo, Yongkun Li, Min Lv, Yinlong Xu, John C.S. Lui
Docker containers are widely deployed to provide lightweight virtualization, and they have many desirable features such as ease of deployment and near bare-metal performance. However, both the performance and cache efficiency of containers are still limited by their storage drivers due to the coarse-grained copy-on-write operations, and the large amount of redundancy in both I/O requests and page cache. To improve I/O performance and cache efficiency of containers, we develop HP-Mapper, a high performance storage driver for Docker containers. HP-Mapper provides a two-level mapping strategy to support fine-grained copy-on-write with low overhead, and an efficient interception method to reduce redundant I/Os. Furthermore, it uses a novel cache management mechanism to reduce duplicate cached data. Experiment results with our prototype system show that HP-Mapper significantly reduces copy-on-write latency due to its finer-grained copy-on-write scheme. Moreover, HP-Mapper can also reduce 65.4% cache usage on average due to elimination of duplicated data. As a result, HP-Mapper improves the throughput of real-world workloads by up to 39.4%, and improves the startup speed of containers by 2.0x.
Docker容器被广泛部署以提供轻量级虚拟化,并且它们具有许多理想的特性,例如易于部署和接近裸机的性能。然而,由于粗粒度的写时复制操作,以及I/O请求和页面缓存中的大量冗余,容器的性能和缓存效率仍然受到存储驱动程序的限制。为了提高容器的I/O性能和缓存效率,我们开发了用于Docker容器的高性能存储驱动HP-Mapper。HP-Mapper提供了一个两级映射策略,以低开销支持细粒度的写时复制,并提供了一个有效的拦截方法来减少冗余I/ o。此外,它还使用了一种新的缓存管理机制来减少重复的缓存数据。我们的原型系统的实验结果表明,HP-Mapper由于其细粒度的写时复制方案而显着降低了写时复制延迟。此外,由于消除了重复数据,HP-Mapper还可以平均减少65.4%的缓存使用。因此,HP-Mapper将实际工作负载的吞吐量提高了39.4%,并将容器的启动速度提高了2.0倍。
{"title":"HP-Mapper: A High Performance Storage Driver for Docker Containers","authors":"Fan Guo, Yongkun Li, Min Lv, Yinlong Xu, John C.S. Lui","doi":"10.1145/3357223.3362718","DOIUrl":"https://doi.org/10.1145/3357223.3362718","url":null,"abstract":"Docker containers are widely deployed to provide lightweight virtualization, and they have many desirable features such as ease of deployment and near bare-metal performance. However, both the performance and cache efficiency of containers are still limited by their storage drivers due to the coarse-grained copy-on-write operations, and the large amount of redundancy in both I/O requests and page cache. To improve I/O performance and cache efficiency of containers, we develop HP-Mapper, a high performance storage driver for Docker containers. HP-Mapper provides a two-level mapping strategy to support fine-grained copy-on-write with low overhead, and an efficient interception method to reduce redundant I/Os. Furthermore, it uses a novel cache management mechanism to reduce duplicate cached data. Experiment results with our prototype system show that HP-Mapper significantly reduces copy-on-write latency due to its finer-grained copy-on-write scheme. Moreover, HP-Mapper can also reduce 65.4% cache usage on average due to elimination of duplicated data. As a result, HP-Mapper improves the throughput of real-world workloads by up to 39.4%, and improves the startup speed of containers by 2.0x.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90705452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
MME-FaaS Cloud-Native Control for Mobile Networks 移动网络的MME-FaaS云原生控制
Sonika Jindal, R. Ricci
The control plane for mobile wireless (eg. cellular) networks faces challenges with respect to scaling, robustness, and handling of bursty traffic. In this paper, we take a cloud-native approach to building a mobile control plane, employing a design that maps transitions of device state to serverless functions. Using a prototype of the LTE/EPC Mobility Management Entity (MME), we demonstrate how to architect a mobile control plane using serverless computing primitives. We demonstrate the practicality of this approach, which differs significantly from designs based on traditional telecom infrastructure.
移动无线的控制平面(例如;蜂窝网络面临着伸缩性、健壮性和突发流量处理方面的挑战。在本文中,我们采用云原生方法来构建移动控制平面,采用将设备状态转换映射到无服务器功能的设计。使用LTE/EPC移动管理实体(MME)的原型,我们演示了如何使用无服务器计算原语构建移动控制平面。我们演示了这种方法的实用性,它与基于传统电信基础设施的设计有很大不同。
{"title":"MME-FaaS Cloud-Native Control for Mobile Networks","authors":"Sonika Jindal, R. Ricci","doi":"10.1145/3357223.3362722","DOIUrl":"https://doi.org/10.1145/3357223.3362722","url":null,"abstract":"The control plane for mobile wireless (eg. cellular) networks faces challenges with respect to scaling, robustness, and handling of bursty traffic. In this paper, we take a cloud-native approach to building a mobile control plane, employing a design that maps transitions of device state to serverless functions. Using a prototype of the LTE/EPC Mobility Management Entity (MME), we demonstrate how to architect a mobile control plane using serverless computing primitives. We demonstrate the practicality of this approach, which differs significantly from designs based on traditional telecom infrastructure.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78844262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reverb 混响
R. Netravali, James W. Mickens
Bugs are common in web pages. Unfortunately, traditional debugging primitives like breakpoints are crude tools for understanding the asynchronous, wide-area data flows that bind client-side JavaScript code and server-side application logic. In this paper, we describe Reverb, a powerful new debugger that makes data flows explicit and queryable. Reverb provides three novel features. First, Reverb tracks precise value provenance, allowing a developer to quickly identify the reads and writes to JavaScript state that affected a particular variable's value. Second, Reverb enables speculative bug fix analysis. A developer can replay a program to a certain point, change code or data in the program, and then resume the replay; Reverb uses the remaining log of nondeterministic events to influence the post-edit replay, allowing the developer to investigate whether the hypothesized bug fix would have helped the original execution run. Third, Reverb supports wide-area debugging for applications whose server-side components use event-driven architectures. By tracking the data flows between clients and servers, Reverb enables speculative replaying of the distributed application.
{"title":"Reverb","authors":"R. Netravali, James W. Mickens","doi":"10.1145/3357223.3362733","DOIUrl":"https://doi.org/10.1145/3357223.3362733","url":null,"abstract":"Bugs are common in web pages. Unfortunately, traditional debugging primitives like breakpoints are crude tools for understanding the asynchronous, wide-area data flows that bind client-side JavaScript code and server-side application logic. In this paper, we describe Reverb, a powerful new debugger that makes data flows explicit and queryable. Reverb provides three novel features. First, Reverb tracks precise value provenance, allowing a developer to quickly identify the reads and writes to JavaScript state that affected a particular variable's value. Second, Reverb enables speculative bug fix analysis. A developer can replay a program to a certain point, change code or data in the program, and then resume the replay; Reverb uses the remaining log of nondeterministic events to influence the post-edit replay, allowing the developer to investigate whether the hypothesized bug fix would have helped the original execution run. Third, Reverb supports wide-area debugging for applications whose server-side components use event-driven architectures. By tracking the data flows between clients and servers, Reverb enables speculative replaying of the distributed application.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"11 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91401993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DCUDA: Dynamic GPU Scheduling with Live Migration Support DCUDA:支持实时迁移的动态GPU调度
Fan Guo, Yongkun Li, John C.S. Lui, Yinlong Xu
In clouds and data centers, GPU servers which consist of multiple GPUs are widely deployed. Current state-of-the-art GPU scheduling algorithm are "static" in assigning applications to different GPUs. These algorithms usually ignore the dynamics of the GPU utilization and are often inaccurate in estimating resource demand before assigning/running applications, so there is a large opportunity to further load balance and to improve GPU utilization. Based on CUDA (Compute Unified Device Architecture), we develop a runtime system called DCUDA which supports "dynamic" scheduling of running applications between multiple GPUs. In particular, DCUDA provides a realtime and lightweight method to accurately monitor the resource demand of applications and GPU utilization. Furthermore, it provides a universal migration facility to migrate "running applications" between GPUs with negligible overhead. More importantly, DCUDA transparently supports all CUDA applications without changing their source codes. Experiments with our prototype system show that DCUDA can reduce 78.3% of overloaded time of GPUs on average. As a result, for different workloads consisting of a wide range applications we studied, DCUDA can reduce the average execution time of applications by up to 42.1%. Furthermore, DCUDA also reduces 13.3% energy in the light load scenario.
在云和数据中心中,由多个GPU组成的GPU服务器被广泛部署。目前最先进的GPU调度算法在分配应用程序到不同的GPU时是“静态的”。这些算法通常忽略了GPU利用率的动态,并且在分配/运行应用程序之前估计资源需求通常是不准确的,因此有很大的机会进一步平衡负载并提高GPU利用率。基于CUDA(计算统一设备架构),我们开发了一个名为DCUDA的运行时系统,它支持在多个gpu之间运行应用程序的“动态”调度。特别是,DCUDA提供了一种实时和轻量级的方法来准确监控应用程序的资源需求和GPU利用率。此外,它提供了一个通用的迁移工具,可以在gpu之间迁移“正在运行的应用程序”,开销可以忽略不计。更重要的是,DCUDA透明地支持所有CUDA应用程序,而无需更改其源代码。在我们的原型系统上进行的实验表明,DCUDA可以平均减少gpu的过载时间78.3%。因此,对于由我们研究的广泛应用程序组成的不同工作负载,DCUDA可以将应用程序的平均执行时间减少42.1%。此外,DCUDA还可以在轻负载情况下减少13.3%的能量。
{"title":"DCUDA: Dynamic GPU Scheduling with Live Migration Support","authors":"Fan Guo, Yongkun Li, John C.S. Lui, Yinlong Xu","doi":"10.1145/3357223.3362714","DOIUrl":"https://doi.org/10.1145/3357223.3362714","url":null,"abstract":"In clouds and data centers, GPU servers which consist of multiple GPUs are widely deployed. Current state-of-the-art GPU scheduling algorithm are \"static\" in assigning applications to different GPUs. These algorithms usually ignore the dynamics of the GPU utilization and are often inaccurate in estimating resource demand before assigning/running applications, so there is a large opportunity to further load balance and to improve GPU utilization. Based on CUDA (Compute Unified Device Architecture), we develop a runtime system called DCUDA which supports \"dynamic\" scheduling of running applications between multiple GPUs. In particular, DCUDA provides a realtime and lightweight method to accurately monitor the resource demand of applications and GPU utilization. Furthermore, it provides a universal migration facility to migrate \"running applications\" between GPUs with negligible overhead. More importantly, DCUDA transparently supports all CUDA applications without changing their source codes. Experiments with our prototype system show that DCUDA can reduce 78.3% of overloaded time of GPUs on average. As a result, for different workloads consisting of a wide range applications we studied, DCUDA can reduce the average execution time of applications by up to 42.1%. Furthermore, DCUDA also reduces 13.3% energy in the light load scenario.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82314739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Sifter
P. Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace
Distributed tracing is a core component of cloud and datacenter systems, and provides visibility into their end-to-end runtime behavior. To reduce computational and storage overheads, most tracing frameworks do not keep all traces, but sample them uniformly at random. While effective at reducing overheads, uniform random sampling inevitably captures redundant, common-case execution traces, which are less useful for analysis and troubleshooting tasks. In this work we present Sifter, a general-purpose framework for biased trace sampling. Sifter captures qualitatively more diverse traces, by weighting sampling decisions towards edge-case code paths, infrequent request types, and anomalous events. Sifter does so by using the incoming stream of traces to build an unbiased low-dimensional model that approximates the system's common-case behavior. Sifter then biases sampling decisions towards traces that are poorly captured by this model. We have implemented Sifter, integrated it with several open-source tracing systems, and evaluate with traces from a range of open-source and production distributed systems. Our evaluation shows that Sifter effectively biases towards anomalous and outlier executions, is robust to noisy and heterogeneous traces, is efficient and scalable, and adapts to changes in workloads over time.
{"title":"Sifter","authors":"P. Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace","doi":"10.1145/3357223.3362736","DOIUrl":"https://doi.org/10.1145/3357223.3362736","url":null,"abstract":"Distributed tracing is a core component of cloud and datacenter systems, and provides visibility into their end-to-end runtime behavior. To reduce computational and storage overheads, most tracing frameworks do not keep all traces, but sample them uniformly at random. While effective at reducing overheads, uniform random sampling inevitably captures redundant, common-case execution traces, which are less useful for analysis and troubleshooting tasks. In this work we present Sifter, a general-purpose framework for biased trace sampling. Sifter captures qualitatively more diverse traces, by weighting sampling decisions towards edge-case code paths, infrequent request types, and anomalous events. Sifter does so by using the incoming stream of traces to build an unbiased low-dimensional model that approximates the system's common-case behavior. Sifter then biases sampling decisions towards traces that are poorly captured by this model. We have implemented Sifter, integrated it with several open-source tracing systems, and evaluate with traces from a range of open-source and production distributed systems. Our evaluation shows that Sifter effectively biases towards anomalous and outlier executions, is robust to noisy and heterogeneous traces, is efficient and scalable, and adapts to changes in workloads over time.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81540433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Accordia
Yang Liu, Huanle Xu, W. Lau
{"title":"Accordia","authors":"Yang Liu, Huanle Xu, W. Lau","doi":"10.1145/3357223.3365441","DOIUrl":"https://doi.org/10.1145/3357223.3365441","url":null,"abstract":"","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89195080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline HyperSched:最后期限上模型开发的动态资源重新分配
Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph Gonzalez, I. Stoica, Alexey Tumanov
Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible. To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched---a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload overlooked in prior work -- trial disposability, progressively identifiable rankings among different configurations, and space-time constraints -- to outperform standard hyperparameter search algorithms across a variety of benchmarks.
在机器学习训练工作量的资源调度方面,先前的研究主要集中在最小化任务完成时间上。通常,这些模型训练工作负载在超参数搜索中共同搜索控制学习过程的大量参数值。最好识别并最大限度地提供性能最佳的超参数配置(试用),以尽快获得最高精度的结果。为了最佳地权衡评估多种配置并在固定的截止日期前训练最有前途的配置,我们设计并构建了HyperSched——一个动态的应用程序级资源调度程序,用于跟踪、识别和优先分配资源给表现最好的试验,以在截止日期前最大化准确性。HyperSched利用了在以前的工作中忽略的超参数搜索工作负载的三个属性——试验可丢弃性、在不同配置中逐步可识别的排名以及时空约束——在各种基准测试中优于标准的超参数搜索算法。
{"title":"HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline","authors":"Richard Liaw, Romil Bhardwaj, Lisa Dunlap, Yitian Zou, Joseph Gonzalez, I. Stoica, Alexey Tumanov","doi":"10.1145/3357223.3362719","DOIUrl":"https://doi.org/10.1145/3357223.3362719","url":null,"abstract":"Prior research in resource scheduling for machine learning training workloads has largely focused on minimizing job completion times. Commonly, these model training workloads collectively search over a large number of parameter values that control the learning process in a hyperparameter search. It is preferable to identify and maximally provision the best-performing hyperparameter configuration (trial) to achieve the highest accuracy result as soon as possible. To optimally trade-off evaluating multiple configurations and training the most promising ones by a fixed deadline, we design and build HyperSched---a dynamic application-level resource scheduler to track, identify, and preferentially allocate resources to the best performing trials to maximize accuracy by the deadline. HyperSched leverages three properties of a hyperparameter search workload overlooked in prior work -- trial disposability, progressively identifiable rankings among different configurations, and space-time constraints -- to outperform standard hyperparameter search algorithms across a variety of benchmarks.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"44 4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90033712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Towards a Library for Deterministic Failure Testing of Distributed Systems 面向分布式系统确定性故障测试的库
Armin Balalaie, James A. Jones
Author(s): Balalaie, Armin | Advisor(s): Jones, James A. | Abstract: Distributed systems are widespread today, and they are being used to serve millions of customers and process huge amounts of data. These systems run on commodity hardware and in an environment with many uncertainties, e.g., partial network failures and race condition between nodes. Testing distributed systems requires new test libraries that take into account these uncertainties and can reproduce scenarios with specificc timing constraints in a programming-language-agnostic way. To this end, we present Failify, a cross-platform, programming-language-agnostic and deterministic failure testing library for distributed systems, which can be seamlessly integrated into different build systems. Failify, as an infrastructure, can also facilitate research in testing distributed systems in various ways. We experimented with six open-source distributed systems to show the compactness of the Failify's deployment API. Our results indicate that, in average, the most reliable deployment architecture for these systems can be defined in less that 17 lines of code. We also experimented with HDFS to demonstrate potential scenarios where Failify's deterministic environmental manipulation and failure injection API can be effective.
作者:Balalaie, Armin;顾问:Jones, James A. |摘要:分布式系统在当今广泛应用,它们被用于服务数百万客户并处理大量数据。这些系统运行在商用硬件和有许多不确定因素的环境中,例如,部分网络故障和节点之间的竞争条件。测试分布式系统需要考虑到这些不确定性的新测试库,并且能够以与编程语言无关的方式再现具有特定时间约束的场景。为此,我们提出了Failify,这是一个跨平台的、与编程语言无关的、确定性的分布式系统故障测试库,可以无缝集成到不同的构建系统中。Failify作为一种基础设施,还可以促进以各种方式测试分布式系统的研究。我们对六个开源分布式系统进行了实验,以展示Failify部署API的紧凑性。我们的结果表明,平均而言,这些系统的最可靠的部署架构可以用少于17行代码来定义。我们还对HDFS进行了实验,以演示Failify的确定性环境操作和失败注入API可能有效的潜在场景。
{"title":"Towards a Library for Deterministic Failure Testing of Distributed Systems","authors":"Armin Balalaie, James A. Jones","doi":"10.1145/3357223.3366026","DOIUrl":"https://doi.org/10.1145/3357223.3366026","url":null,"abstract":"Author(s): Balalaie, Armin | Advisor(s): Jones, James A. | Abstract: Distributed systems are widespread today, and they are being used to serve millions of customers and process huge amounts of data. These systems run on commodity hardware and in an environment with many uncertainties, e.g., partial network failures and race condition between nodes. Testing distributed systems requires new test libraries that take into account these uncertainties and can reproduce scenarios with specificc timing constraints in a programming-language-agnostic way. To this end, we present Failify, a cross-platform, programming-language-agnostic and deterministic failure testing library for distributed systems, which can be seamlessly integrated into different build systems. Failify, as an infrastructure, can also facilitate research in testing distributed systems in various ways. We experimented with six open-source distributed systems to show the compactness of the Failify's deployment API. Our results indicate that, in average, the most reliable deployment architecture for these systems can be defined in less that 17 lines of code. We also experimented with HDFS to demonstrate potential scenarios where Failify's deterministic environmental manipulation and failure injection API can be effective.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74583021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PerfDebug PerfDebug
Jason Teoh, Muhammad Ali Gulzar, Guoqing Harry Xu, Miryung Kim
Performance is a key factor for big data applications, and much research has been devoted to optimizing these applications. While prior work can diagnose and correct data skew, the problem of computation skew---abnormally high computation costs for a small subset of input data---has been largely overlooked. Computation skew commonly occurs in real-world applications and yet no tool is available for developers to pinpoint underlying causes. To enable a user to debug applications that exhibit computation skew, we develop a post-mortem performance debugging tool. PerfDebug automatically finds input records responsible for such abnormalities in a big data application by reasoning about deviations in performance metrics such as job execution time, garbage collection time, and serialization time. The key to PerfDebug's success is a data provenance-based technique that computes and propagates record-level computation latency to keep track of abnormally expensive records throughout the pipeline. Finally, the input records that have the largest latency contributions are presented to the user for bug fixing. We evaluate PerfDebug via in-depth case studies and observe that remediation such as removing the single most expensive record or simple code rewrite can achieve up to 16X performance improvement.
{"title":"PerfDebug","authors":"Jason Teoh, Muhammad Ali Gulzar, Guoqing Harry Xu, Miryung Kim","doi":"10.1145/3357223.3362727","DOIUrl":"https://doi.org/10.1145/3357223.3362727","url":null,"abstract":"Performance is a key factor for big data applications, and much research has been devoted to optimizing these applications. While prior work can diagnose and correct data skew, the problem of computation skew---abnormally high computation costs for a small subset of input data---has been largely overlooked. Computation skew commonly occurs in real-world applications and yet no tool is available for developers to pinpoint underlying causes. To enable a user to debug applications that exhibit computation skew, we develop a post-mortem performance debugging tool. PerfDebug automatically finds input records responsible for such abnormalities in a big data application by reasoning about deviations in performance metrics such as job execution time, garbage collection time, and serialization time. The key to PerfDebug's success is a data provenance-based technique that computes and propagates record-level computation latency to keep track of abnormally expensive records throughout the pipeline. Finally, the input records that have the largest latency contributions are presented to the user for bug fixing. We evaluate PerfDebug via in-depth case studies and observe that remediation such as removing the single most expensive record or simple code rewrite can achieve up to 16X performance improvement.","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"142 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75655903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Neptune 海王星
Panagiotis Garefalakis, Konstantinos Karanasos, P. Pietzuch
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Neptune Hippius ha roue Marc’h e zivskouarn marc’h Patrice Marquand
它是一个多学科的开放获取档案,用于科学研究文件的存储和传播,无论它们是否出版。这些文件可能来自法国或国外的教学和研究机构,也可能来自公共或私人研究中心。HAL开放多学科档案旨在存放和传播来自法国或外国教育和研究机构、公共或私人实验室的已发表或未发表的研究级科学文件。海王星Hippius ha wheel marc ' h e zivskouarn marc ' h Patrice Marquand
{"title":"Neptune","authors":"Panagiotis Garefalakis, Konstantinos Karanasos, P. Pietzuch","doi":"10.1145/3357223.3362724","DOIUrl":"https://doi.org/10.1145/3357223.3362724","url":null,"abstract":"HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Neptune Hippius ha roue Marc’h e zivskouarn marc’h Patrice Marquand","PeriodicalId":91949,"journal":{"name":"Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81493176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
Proceedings of the ... ACM Symposium on Cloud Computing [electronic resource] : SOCC ... ... SoCC (Conference)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1