Proceedings of the 27th ACM Symposium on Operating Systems Principles最新文献_第2页

Verifying software network functions with no verification expertise 在没有验证专业知识的情况下验证软件网络功能

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359647

Arseniy Zaostrovnykh, Solal Pirelli, Rishabh R. Iyer, Matteo Rizzo, Luis Pedrosa, K. Argyraki, George Candea

We present the design and implementation of Vigor, a software stack and toolchain for building and running software network middleboxes that are guaranteed to be correct, while preserving competitive performance and developer productivity. Developers write the core of the middlebox---the network function (NF)---in C, on top of a standard packet-processing framework, putting persistent state in data structures from Vigor's library; the Vigor toolchain then automatically verifies that the resulting software stack correctly implements a specification, which is written in Python. Vigor has three key features: network function developers need no verification expertise, and the verification process does not require their assistance (push-button verification); the entire software stack is verified, down to the hardware (full-stack verification); and verification can be done in a pay-as-you-go manner, i.e., instead of investing upfront a lot of time in writing and verifying a complete specification, one can specify one-off properties in a few lines of Python and verify them without concern for the rest. We developed five representative NFs---a NAT, a Maglev load balancer, a MAC-learning bridge, a firewall, and a traffic policer---and verified with Vigor that they satisfy standards-derived specifications, are memory-safe, and do not crash or hang. We show that they provide competitive performance. The Vigor framework is available at http://vigor.epfl.ch.

我们介绍了Vigor的设计和实现，这是一个软件堆栈和工具链，用于构建和运行软件网络中间件，保证正确，同时保持竞争性能和开发人员的生产力。开发人员用C语言编写中间件的核心——网络功能(NF)，在标准的数据包处理框架之上，将持久状态放入Vigor库中的数据结构中;然后，Vigor工具链自动验证生成的软件堆栈是否正确地实现了用Python编写的规范。Vigor有三个关键特点:网络功能开发人员不需要验证专业知识，并且验证过程不需要他们的帮助(按钮验证);整个软件栈被验证，直到硬件(全栈验证);并且验证可以以现收现付的方式完成，也就是说，与其在编写和验证完整的规范上预先投入大量时间，不如在几行Python中指定一次性属性并验证它们，而无需考虑其余部分。我们开发了五个具有代表性的NFs——NAT、磁悬浮负载平衡器、mac学习桥接器、防火墙和流量管理器——并通过Vigor验证了它们满足标准衍生的规范，是内存安全的，不会崩溃或挂起。我们证明他们提供了有竞争力的表现。Vigor框架可从http://vigor.epfl.ch获得。

{"title":"Verifying software network functions with no verification expertise","authors":"Arseniy Zaostrovnykh, Solal Pirelli, Rishabh R. Iyer, Matteo Rizzo, Luis Pedrosa, K. Argyraki, George Candea","doi":"10.1145/3341301.3359647","DOIUrl":"https://doi.org/10.1145/3341301.3359647","url":null,"abstract":"We present the design and implementation of Vigor, a software stack and toolchain for building and running software network middleboxes that are guaranteed to be correct, while preserving competitive performance and developer productivity. Developers write the core of the middlebox---the network function (NF)---in C, on top of a standard packet-processing framework, putting persistent state in data structures from Vigor's library; the Vigor toolchain then automatically verifies that the resulting software stack correctly implements a specification, which is written in Python. Vigor has three key features: network function developers need no verification expertise, and the verification process does not require their assistance (push-button verification); the entire software stack is verified, down to the hardware (full-stack verification); and verification can be done in a pay-as-you-go manner, i.e., instead of investing upfront a lot of time in writing and verifying a complete specification, one can specify one-off properties in a few lines of Python and verify them without concern for the rest. We developed five representative NFs---a NAT, a Maglev load balancer, a MAC-learning bridge, a firewall, and a traffic policer---and verified with Vigor that they satisfy standards-derived specifications, are memory-safe, and do not crash or hang. We show that they provide competitive performance. The Vigor framework is available at http://vigor.epfl.ch.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121618465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Yodel 山歌

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359648

D. Lazar, Y. Gilad, N. Zeldovich

Yodel is the first system for voice calls that hides metadata (e.g., who is communicating with whom) from a powerful adversary that controls the network and compromises servers. Voice calls require sub-second message latency, but low latency has been difficult to achieve in prior work where processing each message requires an expensive public key operation at each hop in the network. Yodel avoids this expense with the idea of self-healing circuits, reusable paths through a mix network that use only fast symmetric cryptography. Once created, these circuits are resilient to passive and active attacks from global adversaries. Creating and connecting to these circuits without leaking metadata is another challenge that Yodel addresses with the idea of guarded circuit exchange, where each user creates a backup circuit in case an attacker tampers with their traffic. We evaluate Yodel across the internet and it achieves acceptable voice quality with 990 ms of latency for 5 million simulated users.

引用次数: 25

ShortCut 快捷方式

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359659

Xianzheng Dou, Peter M. Chen, J. Flinn

Applications commonly perform repeated computations that are mostly, but not exactly, similar. If a subsequent computation were identical to the original, the operating system could improve performance via memoization, i.e., capturing the differences in program state caused by the computation and applying the differences in lieu of re-executing the computation. However, opportunities for generic memoization are limited by a myriad of differences that arise during execution, e.g., timestamps differ and communication yields non-deterministic responses. Such difference cause memoization to produce incorrect state. ShortCut generically accelerates mostly-deterministic computation by partial memoization. It creates a program, called a slice, that modifies the state diff to account for variation in a subsequent computation. ShortCut learns which inputs, data flows and control flows are likely, and makes assumptions about possible values for each during slice generation. Assuming only likely values rather than allowing all possible values makes complex slice generation feasible and slice execution much faster. Slices are self-verifying; they include predicates that verify all assumptions made during a subsequent execution. When these verifications succeed, the slice is guaranteed to produce a correct modification. If a verification fails, ShortCut transparently rolls back the slice execution and runs the non-memoized computation. Users see no difference between normal, memoized, and rolled-back execution.

{"title":"ShortCut","authors":"Xianzheng Dou, Peter M. Chen, J. Flinn","doi":"10.1145/3341301.3359659","DOIUrl":"https://doi.org/10.1145/3341301.3359659","url":null,"abstract":"Applications commonly perform repeated computations that are mostly, but not exactly, similar. If a subsequent computation were identical to the original, the operating system could improve performance via memoization, i.e., capturing the differences in program state caused by the computation and applying the differences in lieu of re-executing the computation. However, opportunities for generic memoization are limited by a myriad of differences that arise during execution, e.g., timestamps differ and communication yields non-deterministic responses. Such difference cause memoization to produce incorrect state. ShortCut generically accelerates mostly-deterministic computation by partial memoization. It creates a program, called a slice, that modifies the state diff to account for variation in a subsequent computation. ShortCut learns which inputs, data flows and control flows are likely, and makes assumptions about possible values for each during slice generation. Assuming only likely values rather than allowing all possible values makes complex slice generation feasible and slice execution much faster. Slices are self-verifying; they include predicates that verify all assumptions made during a subsequent execution. When these verifications succeed, the slice is guaranteed to produce a correct modification. If a verification fails, ShortCut transparently rolls back the slice execution and runs the non-memoized computation. Users see no difference between normal, memoized, and rolled-back execution.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121818506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance and protection in the ZoFS user-space NVM file system ZoFS用户空间NVM文件系统的性能和保护

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359637

Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, Haibo Chen

Non-volatile memory (NVM) can be directly accessed in user space without going through the kernel. This encourages several recent studies on building user-space NVM file systems. However, for the sake of file system protection, none of the existing file systems grant user-space file system libraries with direct control over both metadata and data of the NVM, leaving fast NVM resources underexploited. Based on the observation that applications tend to group files with similar access permissions within the same directory and permission changes are rare operations, this paper proposes a new abstraction called coffer, which is a collection of isolated NVM resources, and show its merits on building a performant and protected NVM file system in user space. The key idea is to separate NVM protection from management via coffers so that user-space libraries can take full control of NVM within a coffer while the kernel guarantees strict isolation among coffers. Based on coffers, we build an NVM file system architecture to bring the high performance of NVM to unmodified dynamically linked applications and facilitate the development of performant and flexible user-space NVM file system libraries. With an example file system called ZoFS, we show that user-space file systems built upon coffers can outperform existing NVM file systems in both benchmarks and real-world applications.

非易失性内存(NVM)可以在用户空间中直接访问，而不需要经过内核。这鼓励了最近一些关于构建用户空间NVM文件系统的研究。但是，出于文件系统保护的考虑，现有的文件系统都不允许用户空间文件系统库直接控制NVM的元数据和数据，从而使快速的NVM资源得不到充分利用。基于应用程序倾向于在同一目录下对具有相似访问权限的文件进行分组，并且权限更改是罕见的操作，本文提出了一种新的抽象，称为保险箱，它是隔离的NVM资源的集合，并展示了它在用户空间中构建高性能和受保护的NVM文件系统的优点。关键思想是通过保险箱将NVM保护与管理分开，这样用户空间库就可以完全控制保险箱内的NVM，而内核保证了保险箱之间的严格隔离。基于保险箱，我们构建了一个NVM文件系统架构，将NVM的高性能引入到未修改的动态链接应用中，促进高性能、灵活的用户空间NVM文件系统库的开发。通过一个名为ZoFS的示例文件系统，我们展示了在保险箱上构建的用户空间文件系统在基准测试和实际应用程序中都可以优于现有的NVM文件系统。

{"title":"Performance and protection in the ZoFS user-space NVM file system","authors":"Mingkai Dong, Heng Bu, Jifei Yi, Benchao Dong, Haibo Chen","doi":"10.1145/3341301.3359637","DOIUrl":"https://doi.org/10.1145/3341301.3359637","url":null,"abstract":"Non-volatile memory (NVM) can be directly accessed in user space without going through the kernel. This encourages several recent studies on building user-space NVM file systems. However, for the sake of file system protection, none of the existing file systems grant user-space file system libraries with direct control over both metadata and data of the NVM, leaving fast NVM resources underexploited. Based on the observation that applications tend to group files with similar access permissions within the same directory and permission changes are rare operations, this paper proposes a new abstraction called coffer, which is a collection of isolated NVM resources, and show its merits on building a performant and protected NVM file system in user space. The key idea is to separate NVM protection from management via coffers so that user-space libraries can take full control of NVM within a coffer while the kernel guarantees strict isolation among coffers. Based on coffers, we build an NVM file system architecture to bring the high performance of NVM to unmodified dynamically linked applications and facilitate the development of performant and flexible user-space NVM file system libraries. With an example file system called ZoFS, we show that user-space file systems built upon coffers can outperform existing NVM file systems in both benchmarks and real-world applications.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124805639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Snap: a microkernel approach to host networking Snap:主机网络的一种微内核方法

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359657

Michael R. Marty, M. Kruijf, Jacob Adriaens, C. Alfeld, S. Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, S. Gribble, Nicholas Kidd, R. Kononov, G. Kumar, Carl J. Mauer, Emily Musick, Lena E. Olson, Erik Rubow, Michael Ryan, K. Springborn, Paul Turner, V. Valancius, Xi Wang, Amin Vahdat

This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google's rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams.

本文介绍了一种名为Snap的微内核启发的主机网络方法的设计和经验。Snap是一个用户空间网络系统，通过灵活的模块支持谷歌快速发展的需求，这些模块实现了一系列网络功能，包括边缘分组交换、云平台虚拟化、流量塑造策略实施、高性能可靠的消息传递和类似rdma的服务。Snap已经在生产环境中运行了三年多，支持几个大型关键系统的可扩展通信需求。Snap支持快速开发和部署新的网络功能，利用地址空间隔离的优势和用户空间软件开发的生产力，同时支持透明地升级网络服务，而无需将应用程序迁移出机器。同时，Snap通过模块化架构实现了令人信服的性能，该架构通过最小的状态共享促进原则性同步，并通过新颖的内核/用户空间CPU调度程序协同设计支持CPU资源动态扩展的实时调度。我们的评估显示，与RPC工作负载的内核网络堆栈相比，它的性能提高了3倍/核心，基于软件的类似rdma的性能高达5M IOPS/核心，并且用户应用程序在很大程度上无法察觉到透明的升级。Snap部署在我们超过一半的机器上，并支持众多团队的需求。

{"title":"Snap: a microkernel approach to host networking","authors":"Michael R. Marty, M. Kruijf, Jacob Adriaens, C. Alfeld, S. Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C. Evans, S. Gribble, Nicholas Kidd, R. Kononov, G. Kumar, Carl J. Mauer, Emily Musick, Lena E. Olson, Erik Rubow, Michael Ryan, K. Springborn, Paul Turner, V. Valancius, Xi Wang, Amin Vahdat","doi":"10.1145/3341301.3359657","DOIUrl":"https://doi.org/10.1145/3341301.3359657","url":null,"abstract":"This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google's rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems. Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125117554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 136

Scalable and practical locking with shuffling 可扩展和实用的锁定与洗牌

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359629

Sanidhya Kashyap, I. Calciu, Xiao-he Cheng, Changwoo Min, Taesoo Kim

Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, inpractice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance. In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5x, with impressive memory footprint reduction compared with the recent lock algorithms.

锁是高性能多核系统软件的重要组成部分。为了满足性能目标，锁算法已经演变为针对体系结构特征(例如NUMA)的专门解决方案。然而，在实践中，应用程序运行在不同的服务器平台上，并表现出随时间变化而变化的各种各样的行为(例如，线程数、锁数)。这给针对单一场景和平台优化的锁带来了性能和可伸缩性问题。例如，流行的自旋锁在NUMA系统中存在过多的缓存线反弹，而可扩展的、NUMA感知的锁表现出低于标准的单线程性能。在本文中，我们确定了影响锁算法性能的四个主要因素。然后，我们提出了一种新的技术，洗牌，它可以动态地适应所有这些因素，而不会减慢锁的关键路径。洗牌的关键思想是按照预先建立的策略对等待获取锁的线程队列重新排序。为了获得最佳性能，这项工作由服务员线程在关键路径之外完成。通过变换，我们演示了如何实现numa感知并实现有效的停车/唤醒策略，而不需要任何辅助数据结构，主要是在关键路径之外。评估表明，我们基于变换的锁系列将实际应用程序的吞吐量提高了12.5倍，与最近的锁算法相比，内存占用显著减少。

{"title":"Scalable and practical locking with shuffling","authors":"Sanidhya Kashyap, I. Calciu, Xiao-he Cheng, Changwoo Min, Taesoo Kim","doi":"10.1145/3341301.3359629","DOIUrl":"https://doi.org/10.1145/3341301.3359629","url":null,"abstract":"Locks are an essential building block for high-performance multicore system software. To meet performance goals, lock algorithms have evolved towards specialized solutions for architectural characteristics (e.g., NUMA). However, inpractice, applications run on different server platforms and exhibit widely diverse behaviors that evolve with time (e.g., number of threads, number of locks). This creates performance and scalability problems for locks optimized for a single scenario and platform. For example, popular spinlocks suffer from excessive cache-line bouncing in NUMA systems, while scalable, NUMA-aware locks exhibit sub-par single-thread performance. In this paper, we identify four dominating factors that impact the performance of lock algorithms. We then propose a new technique, shuffling, that can dynamically accommodate all these factors, without slowing down the critical path of the lock. The key idea of shuffling is to re-order the queue of threads waiting to acquire the lock in accordance with some pre-established policy. For best performance, this work is done off the critical path, by the waiter threads. Using shuffling, we demonstrate how to achieve NUMA-awareness and implement an efficient parking/wake-up strategy, without any auxiliary data structure, mostly off the critical path. The evaluation shows that our family of locks based on shuffling improves the throughput of real-world applications up to 12.5x, with impressive memory footprint reduction compared with the recent lock algorithms.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121020571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Gerenuk 非洲瞪羚

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359643

Christian Navasca, Cheng Cai, Khanh Nguyen, Brian Demsky, Shan Lu, Miryung Kim, Guoqing Harry Xu

Big Data systems are typically implemented in object-oriented languages such as Java and Scala due to the quick development cycle they provide. These systems are executed on top of a managed runtime such as the Java Virtual Machine (JVM), which requires each data item to be represented as an object before it can be processed. This representation is the direct cause of many kinds of severe inefficiencies. We developed Gerenuk, a compiler and runtime that aims to enable a JVM-based data-parallel system to achieve near-native efficiency by transforming a set of statements in the system for direct execution over inlined native bytes. The key insight leading to Gerenuk's success is two-fold: (1) analytics workloads often use immutable and confined data types. If we speculatively optimize the system and user code with this assumption, the transformation can be made tractable. (2) The flow of data starts at a deserialization point where objects are created from a sequence of native bytes and ends at a serialization point where they are turned back into a byte sequence to be sent to the disk or network. This flow naturally defines a speculative execution region (SER) to be transformed. Gerenuk compiles a SER speculatively into a version that can operate directly over native bytes that come from the disk or network. The Gerenuk runtime aborts the SER execution upon violations of the immutability and confinement assumption and switches to the slow path by deserializing the bytes and re-executing the original SER. Our evaluation on Spark and Hadoop demonstrates promising results.

{"title":"Gerenuk","authors":"Christian Navasca, Cheng Cai, Khanh Nguyen, Brian Demsky, Shan Lu, Miryung Kim, Guoqing Harry Xu","doi":"10.1145/3341301.3359643","DOIUrl":"https://doi.org/10.1145/3341301.3359643","url":null,"abstract":"Big Data systems are typically implemented in object-oriented languages such as Java and Scala due to the quick development cycle they provide. These systems are executed on top of a managed runtime such as the Java Virtual Machine (JVM), which requires each data item to be represented as an object before it can be processed. This representation is the direct cause of many kinds of severe inefficiencies. We developed Gerenuk, a compiler and runtime that aims to enable a JVM-based data-parallel system to achieve near-native efficiency by transforming a set of statements in the system for direct execution over inlined native bytes. The key insight leading to Gerenuk's success is two-fold: (1) analytics workloads often use immutable and confined data types. If we speculatively optimize the system and user code with this assumption, the transformation can be made tractable. (2) The flow of data starts at a deserialization point where objects are created from a sequence of native bytes and ends at a serialization point where they are turned back into a byte sequence to be sent to the disk or network. This flow naturally defines a speculative execution region (SER) to be transformed. Gerenuk compiles a SER speculatively into a version that can operate directly over native bytes that come from the disk or network. The Gerenuk runtime aborts the SER execution upon violations of the immutability and confinement assumption and switches to the slow path by deserializing the bytes and re-executing the original SER. Our evaluation on Spark and Hadoop demonstrates promising results.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114406464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Lineage stash: fault tolerance off the critical path 沿袭存储:关键路径之外的容错

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359653

Stephanie Wang, J. Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, I. Stoica

As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.

随着Spark、Dryad、Flink和Ray等集群计算框架被部署在任务关键型应用程序和越来越大的集群上，它们容忍故障的能力变得越来越重要。这些框架采用两种广泛的容错方法:检查点和沿袭。检查点在正常操作期间显示低开销，但在恢复期间显示高开销，而基于继承的解决方案则进行相反的权衡。我们提出了沿袭存储，这是一种分散的因果日志记录技术，可以在不影响恢复效率的情况下显著降低基于沿袭方法的运行时开销。使用沿袭存储，我们不是在执行任务之前记录任务的信息，而是异步记录它，并将沿袭与任务一起转发。这使得支持具有低运行时和恢复开销的大规模、低延迟(毫秒级)数据处理应用程序成为可能。分布式训练和流处理应用程序的实验结果表明，沿袭存储提供了类似于单独检查点的任务执行延迟，同时产生的恢复开销与传统的基于沿袭的方法一样低。

{"title":"Lineage stash: fault tolerance off the critical path","authors":"Stephanie Wang, J. Liagouris, Robert Nishihara, Philipp Moritz, Ujval Misra, Alexey Tumanov, I. Stoica","doi":"10.1145/3341301.3359653","DOIUrl":"https://doi.org/10.1145/3341301.3359653","url":null,"abstract":"As cluster computing frameworks such as Spark, Dryad, Flink, and Ray are being deployed in mission critical applications and on larger and larger clusters, their ability to tolerate failures is growing in importance. These frameworks employ two broad approaches for fault tolerance: checkpointing and lineage. Checkpointing exhibits low overhead during normal operation but high overhead during recovery, while lineage-based solutions make the opposite tradeoff. We propose the lineage stash, a decentralized causal logging technique that significantly reduces the runtime overhead of lineage-based approaches without impacting recovery efficiency. With the lineage stash, instead of recording the task's information before the task is executed, we record it asynchronously and forward the lineage along with the task. This makes it possible to support large-scale, low-latency (millisecond-level) data processing applications with low runtime and recovery overheads. Experimental results for applications in distributed training and stream processing show that the lineage stash provides task execution latencies similar to checkpointing alone, while incurring a recovery overhead as low as traditional lineage-based approaches.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125850011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Parity models: erasure-coded resilience for prediction serving systems 奇偶模型:预测服务系统的擦除编码弹性

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359654

J. Kosaian, K. V. Rashmi, S. Venkataraman

Machine learning models are becoming the primary work-horses for many applications. Services deploy models through prediction serving systems that take in queries and return predictions by performing inference on models. Prediction serving systems are commonly run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency. Erasure coding is a popular technique for achieving resource-efficient resilience to data unavailability in storage and communication systems. However, existing approaches for imparting erasure-coded resilience to distributed computation apply only to a severely limited class of functions, precluding their use for many serving workloads, such as neural network inference. We introduce parity models, a new approach for enabling erasure-coded resilience in prediction serving systems. A parity model is a neural network trained to transform erasure-coded queries into a form that enables a decoder to reconstruct slow or failed predictions. We implement parity models in ParM, a prediction serving system that makes use of erasure-coded resilience. ParM encodes multiple queries into a "parity query," performs inference over parity queries using parity models, and decodes approximations of unavailable predictions by using the output of a parity model. We showcase the applicability of parity models to image classification, speech recognition, and object localization tasks. Using parity models, ParM reduces the gap between 99.9th percentile and median latency by up to 3.5X, while maintaining the same median. These results display the potential of parity models to unlock a new avenue to imparting resource-efficient resilience to prediction serving systems.

机器学习模型正在成为许多应用程序的主要工作。服务通过预测服务系统部署模型，该系统接受查询并通过对模型执行推理返回预测。预测服务系统通常在集群设置中的许多机器上运行，因此容易出现减速和故障，从而增加尾部延迟。擦除编码是一种流行的技术，用于实现存储和通信系统中数据不可用时的资源高效恢复。然而，为分布式计算赋予擦除编码弹性的现有方法仅适用于非常有限的一类函数，无法用于许多服务工作负载，例如神经网络推理。我们介绍了奇偶模型，这是一种在预测服务系统中实现擦除编码弹性的新方法。奇偶模型是一种经过训练的神经网络，可以将擦除编码查询转换为一种形式，使解码器能够重建缓慢或失败的预测。我们在ParM中实现了奇偶校验模型，这是一个利用擦除编码弹性的预测服务系统。ParM将多个查询编码为“奇偶查询”，使用奇偶模型对奇偶查询执行推理，并使用奇偶模型的输出对不可用预测的近似值进行解码。我们展示了奇偶模型在图像分类、语音识别和对象定位任务中的适用性。使用奇偶校验模型，ParM将99.9百分位和中位数延迟之间的差距减少了3.5倍，同时保持相同的中位数。这些结果显示了平价模型的潜力，可以为预测服务系统提供资源高效弹性的新途径。

{"title":"Parity models: erasure-coded resilience for prediction serving systems","authors":"J. Kosaian, K. V. Rashmi, S. Venkataraman","doi":"10.1145/3341301.3359654","DOIUrl":"https://doi.org/10.1145/3341301.3359654","url":null,"abstract":"Machine learning models are becoming the primary work-horses for many applications. Services deploy models through prediction serving systems that take in queries and return predictions by performing inference on models. Prediction serving systems are commonly run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency. Erasure coding is a popular technique for achieving resource-efficient resilience to data unavailability in storage and communication systems. However, existing approaches for imparting erasure-coded resilience to distributed computation apply only to a severely limited class of functions, precluding their use for many serving workloads, such as neural network inference. We introduce parity models, a new approach for enabling erasure-coded resilience in prediction serving systems. A parity model is a neural network trained to transform erasure-coded queries into a form that enables a decoder to reconstruct slow or failed predictions. We implement parity models in ParM, a prediction serving system that makes use of erasure-coded resilience. ParM encodes multiple queries into a \"parity query,\" performs inference over parity queries using parity models, and decodes approximations of unavailable predictions by using the output of a parity model. We showcase the applicability of parity models to image classification, speech recognition, and object localization tasks. Using parity models, ParM reduces the gap between 99.9th percentile and median latency by up to 3.5X, while maintaining the same median. These results display the potential of parity models to unlock a new avenue to imparting resource-efficient resilience to prediction serving systems.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Finding semantic bugs in file systems with an extensible fuzzing framework 使用可扩展模糊测试框架查找文件系统中的语义错误

Proceedings of the 27th ACM Symposium on Operating Systems Principles

Pub Date : 2019-10-27 DOI: 10.1145/3341301.3359662

Seulbae Kim, Meng Xu, Sanidhya Kashyap, Jungyeon Yoon, Wen Xu, Taesoo Kim

File systems are too large to be bug free. Although handwritten test suites have been widely used to stress file systems, they can hardly keep up with the rapid increase in file system size and complexity, leading to new bugs being introduced and reported regularly. These bugs come in various flavors: simple buffer overflows to sophisticated semantic bugs. Although bug-specific checkers exist, they generally lack a way to explore file system states thoroughly. More importantly, no turnkey solution exists that unifies the checking effort of various aspects of a file system under one umbrella. In this paper, we highlight the potential of applying fuzzing to find not just memory errors but, in theory, any type of file system bugs with an extensible fuzzing framework: Hydra. Hydra provides building blocks for file system fuzzing, including input mutators, feedback engines, a libOS-based executor, and a bug reproducer with test case minimization. As a result, developers only need to focus on building the core logic for finding bugs of their own interests. We showcase the effectiveness of Hydra with four checkers that hunt crash inconsistency, POSIX violations, logic assertion failures, and memory errors. So far, Hydra has discovered 91 new bugs in Linux file systems, including one in a verified file system (FSCQ), as well as four POSIX violations.

文件系统太大，不可能没有bug。尽管手写的测试套件已被广泛用于测试文件系统，但它们很难跟上文件系统大小和复杂性的快速增长，从而导致新错误的引入和定期报告。这些错误有各种各样的形式:简单的缓冲区溢出到复杂的语义错误。尽管存在特定于bug的检查器，但它们通常缺乏彻底探索文件系统状态的方法。更重要的是，不存在将文件系统的各个方面的检查工作统一在一个保护伞下的交钥匙解决方案。在本文中，我们强调了应用模糊测试的潜力，它不仅可以发现内存错误，而且理论上可以通过可扩展的模糊测试框架Hydra发现任何类型的文件系统错误。Hydra为文件系统模糊测试提供了构建块，包括输入变异器、反馈引擎、基于libos的执行器和具有最小化测试用例的错误再现器。因此，开发人员只需要专注于构建核心逻辑，以查找自己感兴趣的bug。我们通过四个检查器来展示Hydra的有效性，这些检查器可以查找崩溃不一致、POSIX违规、逻辑断言失败和内存错误。到目前为止，Hydra已经在Linux文件系统中发现了91个新bug，包括一个经过验证的文件系统(FSCQ)中的bug，以及4个POSIX违规。

{"title":"Finding semantic bugs in file systems with an extensible fuzzing framework","authors":"Seulbae Kim, Meng Xu, Sanidhya Kashyap, Jungyeon Yoon, Wen Xu, Taesoo Kim","doi":"10.1145/3341301.3359662","DOIUrl":"https://doi.org/10.1145/3341301.3359662","url":null,"abstract":"File systems are too large to be bug free. Although handwritten test suites have been widely used to stress file systems, they can hardly keep up with the rapid increase in file system size and complexity, leading to new bugs being introduced and reported regularly. These bugs come in various flavors: simple buffer overflows to sophisticated semantic bugs. Although bug-specific checkers exist, they generally lack a way to explore file system states thoroughly. More importantly, no turnkey solution exists that unifies the checking effort of various aspects of a file system under one umbrella. In this paper, we highlight the potential of applying fuzzing to find not just memory errors but, in theory, any type of file system bugs with an extensible fuzzing framework: Hydra. Hydra provides building blocks for file system fuzzing, including input mutators, feedback engines, a libOS-based executor, and a bug reproducer with test case minimization. As a result, developers only need to focus on building the core logic for finding bugs of their own interests. We showcase the effectiveness of Hydra with four checkers that hunt crash inconsistency, POSIX violations, logic assertion failures, and memory errors. So far, Hydra has discovered 91 new bugs in Linux file systems, including one in a verified file system (FSCQ), as well as four POSIX violations.","PeriodicalId":331561,"journal":{"name":"Proceedings of the 27th ACM Symposium on Operating Systems Principles","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133290835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66