Proceedings of the 27th ACM Symposium on Operating Systems Principles最新文献

英文中文

CrashTuner

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359645

Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, Liang You

Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when nodes crash under special timing conditions. This paper presents CrashTuner, a novel fault-injection testing approach to combat crash-recovery bugs. The novelty of CrashTuner lies in how we identify fault-injection points (crash points) that are likely to expose errors. We observe that if a node crashes while accessing meta-info variables, i.e., variables referencing high-level system state information (e.g., an instance of node or task), it often triggers crash-recovery bugs. Hence, we identify crash points by automatically inferring meta-info variables via a log-based static program analysis. Our approach is automatic and no manual specification is required. We have applied CrashTuner to five representative distributed systems: Hadoop2/Yarn, HBase, HDFS, ZooKeeper, and Cassandra. CrashTuner can finish testing each system in 17.39 hours, and reports 21 new bugs that have never been found before. All new bugs are confirmed by the original developers and 16 of them have already been fixed (14 with our patches). These new bugs can cause severe damages such as cluster down or start-up failures.

{"title":"CrashTuner","authors":"Jie Lu, Chen Liu, Lian Li, Xiaobing Feng, Feng Tan, Jun Yang, Liang You","doi":"10.1145/3341301.3359645","DOIUrl":"https://doi.org/10.1145/3341301.3359645","url":null,"abstract":"Crash-recovery bugs (bugs in crash-recovery-related mechanisms) are among the most severe bugs in cloud systems and can easily cause system failures. It is notoriously difficult to detect crash-recovery bugs since these bugs can only be exposed when nodes crash under special timing conditions. This paper presents CrashTuner, a novel fault-injection testing approach to combat crash-recovery bugs. The novelty of CrashTuner lies in how we identify fault-injection points (crash points) that are likely to expose errors. We observe that if a node crashes while accessing meta-info variables, i.e., variables referencing high-level system state information (e.g., an instance of node or task), it often triggers crash-recovery bugs. Hence, we identify crash points by automatically inferring meta-info variables via a log-based static program analysis. Our approach is automatic and no manual specification is required. We have applied CrashTuner to five representative distributed systems: Hadoop2/Yarn, HBase, HDFS, ZooKeeper, and Cassandra. CrashTuner can finish testing each system in 17.39 hours, and reports 21 new bugs that have never been found before. All new bugs are confirmed by the original developers and 16 of them have already been fixed (14 with our patches). These new bugs can cause severe damages such as cluster down or start-up failures.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122533843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Verifying concurrent, crash-safe systems with Perennial 用Perennial验证并发的碰撞安全系统

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359632

Tej Chajed, Joseph Tassarotti, M. Kaashoek, N. Zeldovich

This paper introduces Perennial, a framework for verifying concurrent, crash-safe systems. Perennial extends the Iris concurrency framework with three techniques to enable crash-safety reasoning: recovery leases, recovery helping, and versioned memory. To ease development and deployment of applications, Perennial provides Goose, a subset of Go and a translator from that subset to a model in Perennial with support for reasoning about Go threads, data structures, and file-system primitives. We implemented and verified a crash-safe, concurrent mail server using Perennial and Goose that achieves speedup on multiple cores. Both Perennial and Iris use the Coq proof assistant, and the mail server and the framework's proofs are machine checked.

本文介绍了一个用于验证并发碰撞安全系统的框架Perennial。Perennial使用三种技术扩展了Iris并发框架，以支持崩溃安全推理:恢复租约、恢复帮助和版本内存。为了简化应用程序的开发和部署，Perennial提供了Goose, Go的一个子集，以及从该子集到Perennial模型的转换器，支持Go线程、数据结构和文件系统原语的推理。我们使用Perennial和Goose实现并验证了一个崩溃安全的并发邮件服务器，它在多核上实现了加速。Perennial和Iris都使用Coq证明助手，邮件服务器和框架的证明都是机器检查的。

引用次数: 44

Fast and secure global payments with Stellar 快速和安全的全球支付与恒星

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359636

Marta Lokhava, Giuliano Losa, David Mazières, Graydon Hoare, N. Barry, E. Gafni, Jonathan Jove, Rafał Malinowsky, Jed McCaleb

International payments are slow and expensive, in part because of multi-hop payment routing through heterogeneous banking systems. Stellar is a new global payment network that can directly transfer digital money anywhere in the world in seconds. The key innovation is a secure transaction mechanism across untrusted intermediaries, using a new Byzantine agreement protocol called SCP. With SCP, each institution specifies other institutions with which to remain in agreement; through the global interconnectedness of the financial system, the whole network then agrees on atomic transactions spanning arbitrary institutions, with no solvency or exchange-rate risk from intermediary asset issuers or market makers. We present SCP's model, protocol, and formal verification; describe the Stellar payment network; and finally evaluate Stellar empirically through benchmarks and our experience with several years of production use.

国际支付缓慢而昂贵，部分原因是通过不同的银行系统进行多跳支付。恒星是一个新的全球支付网络，可以在几秒钟内直接将数字货币转移到世界任何地方。关键的创新是一种跨不可信中介的安全交易机制，使用一种名为SCP的新的拜占庭协议协议。对于SCP，每个机构指定与其保持协议的其他机构;通过金融体系的全球互联性，整个网络就跨越任意机构的原子交易达成一致，没有来自中间资产发行人或做市商的偿付能力或汇率风险。我们提出了SCP的模型、协议和形式验证;描述恒星支付网络;最后通过基准测试和我们几年的生产使用经验来评估恒星。

引用次数: 89

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure 拐点假设:定位故障根本原因的原则性调试方法

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359650

Yongle Zhang, Kirk Rodrigues, Yu Luo, M. Stumm, Ding Yuan

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.

故障诊断的最终目的是找到故障的根本原因。先前的根本原因定位方法几乎都依赖于统计分析。本文基于以下观察提出了一种不同的方法:如果我们将执行建模为一个完全有序的指令序列，那么可以通过第一个指令来确定根本原因，其中失败执行偏离了与失败执行具有最长指令序列前缀的非失败执行。因此，根本原因分析被转换为原则性搜索问题，以识别具有最长公共前缀的非故障执行。我们将介绍Kairux，一个可以做到这一点的工具。在大多数情况下，它能够以完全自动化的方式确定分布式系统故障的根本原因。Kairux使用来自系统丰富的单元测试套件的测试作为构建块来构建非故障执行，该非故障执行与故障执行具有最长的公共前缀，以便找到根本原因。通过对HBase、HDFS和ZooKeeper中一些最复杂的实际故障进行评估，我们发现Kairux可以准确地找出每个故障的根源。

{"title":"The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure","authors":"Yongle Zhang, Kirk Rodrigues, Yu Luo, M. Stumm, Ding Yuan","doi":"10.1145/3341301.3359650","DOIUrl":"https://doi.org/10.1145/3341301.3359650","url":null,"abstract":"The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114906507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Using concurrent relational logic with helpers for verifying the AtomFS file system 使用并发关系逻辑和辅助程序来验证AtomFS文件系统

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359644

Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, Haibo Chen

Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable interfaces to applications. The standard way to prove linearizability requires modeling linearization point of each operation---the moment when its effect becomes visible atomically to other threads. We observe that path inter-dependency, where one operation (like rename) breaks the path integrity of other operations, makes the linearization point external and thus poses a significant challenge to prove linearizability. To overcome the above challenge, this paper presents Concurrent Relational Logic with Helpers (CRL-H), a framework for building verified concurrent file systems. CRL-H is made powerful through two key contributions: (1) extending prior approaches using fixed linearization points with a helper mechanism where one operation of the thread can logically help other threads linearize their operations; (2) combining relational specifications and rely/guarantee conditions for relational and compositional reasoning. We have successfully applied CRL-H to verify the linearizability of AtomFS directly in C code. All the proofs are mechanized in Coq. Evaluations show that AtomFS speeds up file system workloads by utilizing fine-grained, multicore concurrency.

并发文件系统非常普遍，但由于存在不确定性交错，难以正确实现和正式验证。本文介绍了AtomFS，这是第一个经过正式验证的、细粒度的并发文件系统，它为应用程序提供了线性化的接口。证明线性化的标准方法需要对每个操作的线性化点进行建模，即其效果对其他线程自动可见的时刻。我们观察到路径相互依赖，其中一个操作(如重命名)破坏了其他操作的路径完整性，使线性化点外部，从而对证明线性化提出了重大挑战。为了克服上述挑战，本文提出了带有助手的并发关系逻辑(CRL-H)，这是一个用于构建经过验证的并发文件系统的框架。CRL-H通过两个关键贡献变得强大:(1)扩展了先前的方法，使用固定线性化点和辅助机制，线程的一个操作可以在逻辑上帮助其他线程将其操作线性化;(2)结合关系规范和依赖/保证条件进行关系推理和组合推理。我们已经成功地应用了CRL-H直接在C代码中验证了AtomFS的线性性。在Coq中，所有的证明都是机械化的。评估表明，AtomFS通过利用细粒度、多核并发性来加快文件系统工作负载。

{"title":"Using concurrent relational logic with helpers for verifying the AtomFS file system","authors":"Mo Zou, Haoran Ding, Dong Du, Ming Fu, Ronghui Gu, Haibo Chen","doi":"10.1145/3341301.3359644","DOIUrl":"https://doi.org/10.1145/3341301.3359644","url":null,"abstract":"Concurrent file systems are pervasive but hard to correctly implement and formally verify due to nondeterministic interleavings. This paper presents AtomFS, the first formally-verified, fine-grained, concurrent file system, which provides linearizable interfaces to applications. The standard way to prove linearizability requires modeling linearization point of each operation---the moment when its effect becomes visible atomically to other threads. We observe that path inter-dependency, where one operation (like rename) breaks the path integrity of other operations, makes the linearization point external and thus poses a significant challenge to prove linearizability. To overcome the above challenge, this paper presents Concurrent Relational Logic with Helpers (CRL-H), a framework for building verified concurrent file systems. CRL-H is made powerful through two key contributions: (1) extending prior approaches using fixed linearization points with a helper mechanism where one operation of the thread can logically help other threads linearize their operations; (2) combining relational specifications and rely/guarantee conditions for relational and compositional reasoning. We have successfully applied CRL-H to verify the linearizability of AtomFS directly in C code. All the proofs are mechanized in Coq. Evaluations show that AtomFS speeds up file system workloads by utilizing fine-grained, multicore concurrency.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130150556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Risk based planning of network changes in evolving data centers 基于风险的数据中心网络变更规划

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359664

Omid Alipourfard, Jiaqi Gao, Jérémie Koenig, Chris Harshaw, Amin Vahdat, Minlan Yu

Data center networks evolve as they serve customer traffic. When applying network changes, operators risk impacting customer traffic because the network operates at reduced capacity and is more vulnerable to failures and traffic variations. The impact on customer traffic ultimately translates to operator cost (e.g., refunds to customers). However, planning a network change while minimizing the risks is challenging as we need to adapt to a variety of traffic dynamics and cost functions while scaling to large networks and large changes. Today, operators often use plans that maximize the residual capacity (MRC), which often incurs a high cost under different traffic dynamics. Instead, we propose Janus, which searches the large planning space by leveraging the high degree of symmetry in data center networks. Our evaluation on large Clos networks and Facebook traffic traces shows that Janus generates plans in real-time only needing 33~71% of the cost of MRC planners while adapting to a variety of settings.

数据中心网络随着服务客户流量而不断发展。当应用网络变更时，运营商将面临影响客户流量的风险，因为网络容量减少，更容易受到故障和流量变化的影响。对客流量的影响最终转化为运营商成本(例如，向客户退款)。然而，在最小化风险的同时规划网络变更是一项挑战，因为我们需要适应各种流量动态和成本函数，同时扩展到大型网络和大型变更。目前，运营商经常使用最大化剩余容量(MRC)的方案，这在不同的流量动态下往往会产生很高的成本。相反，我们提出Janus，它通过利用数据中心网络中的高度对称性来搜索大型规划空间。我们对大型Clos网络和Facebook流量轨迹的评估表明，Janus实时生成计划的成本仅为MRC计划器的33~71%，同时适应各种设置。

引用次数: 16

An analysis of performance evolution of Linux's core operations Linux核心操作的性能演变分析

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359640

Xiang Ren, Kirk Rodrigues, Luyuan Chen, Juan Camilo Vega, M. Stumm, Ding Yuan

This paper presents an analysis of how Linux's performance has evolved over the past seven years. Unlike recent works that focus on OS performance in terms of scalability or service of a particular workload, this study goes back to basics: the latency of core kernel operations (e.g., system calls, context switching, etc.). To our surprise, the study shows that the performance of many core operations has worsened or fluctuated significantly over the years. For example, the select system call is 100% slower than it was just two years ago. An in-depth analysis shows that over the past seven years, core kernel subsystems have been forced to accommodate an increasing number of security enhancements and new features. These additions steadily add overhead to core kernel operations but also frequently introduce extreme slowdowns of more than 100%. In addition, simple misconfigurations have also severely impacted kernel performance. Overall, we find most of the slowdowns can be attributed to 11 changes. Some forms of slowdown are avoidable with more proactive engineering. We show that it is possible to patch two security enhancements (from the 11 changes) to eliminate most of their overheads. In fact, several features have been introduced to the kernel unoptimized or insufficiently tested and then improved or disabled long after their release. Our findings also highlight both the feasibility and importance for Linux users to actively configure their systems to achieve an optimal balance between performance, functionality, and security: we discover that 8 out of the 11 changes can be avoided by reconfiguring the kernel, and the other 3 can be disabled through simple patches. By disabling the 11 changes with the goal of optimizing performance, we speed up Redis, Apache, and Nginx benchmark workloads by as much as 56%, 33%, and 34%, respectively.

本文分析了Linux的性能在过去七年中是如何演变的。与最近关注操作系统性能方面的可伸缩性或特定工作负载服务的工作不同，这项研究回到了基础:核心内核操作的延迟(例如，系统调用，上下文切换等)。令我们惊讶的是，研究显示，多年来，许多核心业务的业绩恶化或大幅波动。例如，select系统调用比两年前慢了100%。一项深入的分析表明，在过去七年中，核心内核子系统被迫适应越来越多的安全性增强和新特性。这些附加功能稳定地增加了核心内核操作的开销，但也经常导致超过100%的极端减速。此外，简单的错误配置也会严重影响内核性能。总的来说，我们发现大多数减速可以归因于11个变化。一些形式的减速是可以通过更积极的工程来避免的。我们展示了可以修补两个安全性增强(来自11个更改)以消除它们的大部分开销。事实上，内核中引入的一些特性没有经过优化或没有经过充分的测试，然后在发布后很长时间内进行了改进或禁用。我们的研究结果还强调了Linux用户主动配置他们的系统以实现性能、功能和安全性之间的最佳平衡的可行性和重要性:我们发现，通过重新配置内核可以避免11个更改中的8个，另外3个可以通过简单的补丁禁用。通过禁用这11项优化性能的更改，我们将Redis、Apache和Nginx基准工作负载的速度分别提高了56%、33%和34%。

{"title":"An analysis of performance evolution of Linux's core operations","authors":"Xiang Ren, Kirk Rodrigues, Luyuan Chen, Juan Camilo Vega, M. Stumm, Ding Yuan","doi":"10.1145/3341301.3359640","DOIUrl":"https://doi.org/10.1145/3341301.3359640","url":null,"abstract":"This paper presents an analysis of how Linux's performance has evolved over the past seven years. Unlike recent works that focus on OS performance in terms of scalability or service of a particular workload, this study goes back to basics: the latency of core kernel operations (e.g., system calls, context switching, etc.). To our surprise, the study shows that the performance of many core operations has worsened or fluctuated significantly over the years. For example, the select system call is 100% slower than it was just two years ago. An in-depth analysis shows that over the past seven years, core kernel subsystems have been forced to accommodate an increasing number of security enhancements and new features. These additions steadily add overhead to core kernel operations but also frequently introduce extreme slowdowns of more than 100%. In addition, simple misconfigurations have also severely impacted kernel performance. Overall, we find most of the slowdowns can be attributed to 11 changes. Some forms of slowdown are avoidable with more proactive engineering. We show that it is possible to patch two security enhancements (from the 11 changes) to eliminate most of their overheads. In fact, several features have been introduced to the kernel unoptimized or insufficiently tested and then improved or disabled long after their release. Our findings also highlight both the feasibility and importance for Linux users to actively configure their systems to achieve an optimal balance between performance, functionality, and security: we discover that 8 out of the 11 changes can be avoided by reconfiguring the kernel, and the other 3 can be disabled through simple patches. By disabling the 11 changes with the goal of optimizing performance, we speed up Redis, Apache, and Nginx benchmark workloads by as much as 56%, 33%, and 34%, respectively.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124520670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Nexus: a GPU cluster engine for accelerating DNN-based video analysis Nexus:一个GPU集群引擎，用于加速基于dnn的视频分析

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359658

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, A. Krishnamurthy, Ravi Sundaram

We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high utilization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully implemented system that includes these innovations. In large-scale case studies on 16 GPUs, when required to stay within latency constraints at least 99% of the time, Nexus can process requests at rates 1.8-12.7X higher than state of the art systems can. A long-running multi-application deployment stays within 84% of optimal utilization and, on a 100-GPU cluster, violates latency SLOs on 0.27% of requests.

我们解决了从gpu集群有效地服务深度神经网络(dnn)的问题。为了实现像gpu这样的加速器所做的低成本处理的承诺，必须以持续的高利用率运行它们。这样做需要集群规模的资源管理，执行gpu的详细调度，推理需要共同调度的DNN调用组，并从传统的整个DNN执行模型转移到执行DNN片段。Nexus是一个完全实现的系统，包括这些创新。在16个gpu的大规模案例研究中，当需要在至少99%的时间内保持延迟限制时，Nexus处理请求的速度比目前最先进的系统高1.8-12.7倍。长时间运行的多应用程序部署保持在最佳利用率的84%以内，并且在100 gpu集群上，0.27%的请求违反延迟slo。

引用次数: 151

File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution 文件系统不适合作为分布式存储后端:来自Ceph 10年发展的教训

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359656

Abutalib Aghayev, S. Weil, Michael Kuchnik, M. Nelson, G. Ganger, George Amvrosiadis

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph's experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow. Ceph addressed these issues with BlueStore, a new back-end designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backwards-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.

十年来，Ceph分布式文件系统一直遵循在本地文件系统之上构建其存储后端的传统智慧。这是当今大多数分布式文件系统的首选，因为它使它们能够受益于经过实战测试的代码的便利性和成熟度。然而，Ceph的经验表明，这需要付出高昂的代价。首先，开发零开销事务机制具有挑战性。其次，本地级别的元数据性能会显著影响分布式级别的性能。第三，支持新兴的存储硬件非常缓慢。Ceph通过BlueStore解决了这些问题，BlueStore是一个设计用于直接在原始存储设备上运行的新后端。BlueStore在成立后的短短两年内，就超越了之前建立的后端，并在生产中被70%的用户采用。通过在用户空间中运行并完全控制I/O堆栈，它支持节省空间的元数据和数据校验和、对擦除编码数据的快速覆盖、内联压缩、降低性能可变性，并避免了本地文件系统的一系列性能缺陷。最后，它使采用向后不兼容的存储硬件成为可能，这在不断变化的存储环境中是一个重要的特征，因为存储环境正在学习接受硬件多样性。

{"title":"File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution","authors":"Abutalib Aghayev, S. Weil, Michael Kuchnik, M. Nelson, G. Ganger, George Amvrosiadis","doi":"10.1145/3341301.3359656","DOIUrl":"https://doi.org/10.1145/3341301.3359656","url":null,"abstract":"For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph's experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow. Ceph addressed these issues with BlueStore, a new back-end designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backwards-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125497174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Niijima

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359649

Guoqi Xu, Margus Veanes, M. Barnett, Madan Musuvathi, Todd Mytkowicz, Benjamin G. Zorn, Huan He, Haibo Lin

Multilingual data-parallel pipelines, such as Microsoft's Scope and Apache Spark, are widely used in real-world analytical tasks. While the involvement of multiple languages (often including both managed and native languages) provides much convenience in data manipulation and transformation, it comes at a performance cost --- managed languages need a managed runtime, incurring much overhead. In addition, each switch from a managed to a native runtime (and vice versa) requires marshalling or unmarshalling of an ocean of data objects, taking a large fraction of the execution time. This paper presents Niijima, an optimizing compiler for Microsoft's Scope/Cosmos, which can consolidate C#-based user-defined operators (UDOs) across SQL statements, thereby reducing the number of dataflow vertices that require the managed runtime, and thus the amount of C# computations and the data marshalling cost. We demonstrate that Niijima has reduced job latency by an average of 24% and up to 3.3x, on a series of production jobs.

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 27th ACM Symposium on Operating Systems Principles

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀