Operating Systems Review (ACM)最新文献

英文中文

Generating Complex, Realistic Cloud Workloads using Recurrent Neural Networks 使用递归神经网络生成复杂的、现实的云工作负载

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483590

S. Bergsma, Timothy J. Zeyl, Arik Senderovich, J. Christopher Beck

Decision-making in large-scale compute clouds relies on accurate workload modeling. Unfortunately, prior models have proven insufficient in capturing the complex correlations in real cloud workloads. We introduce the first model of large-scale cloud workloads that captures long-range inter-job correlations in arrival rates, resource requirements, and lifetimes. Our approach models workload as a three-stage generative process, with separate models for: (1) the number of batch arrivals over time, (2) the sequence of requested resources, and (3) the sequence of lifetimes. Our lifetime model is a novel extension of recent work in neural survival prediction. It represents and exploits inter-job correlations using a recurrent neural network. We validate our approach by showing it is able to accurately generate the production virtual machine workload of two real-world cloud providers.

大规模计算云中的决策依赖于准确的工作负载建模。不幸的是，先前的模型已被证明不足以捕获实际云工作负载中的复杂相关性。我们介绍了第一个大规模云工作负载模型，该模型捕获了到达率、资源需求和生命周期方面的远距离作业间相关性。我们的方法将工作负载建模为一个三阶段的生成过程，其中有单独的模型:(1)随时间到达的批数量，(2)请求资源的顺序，以及(3)生命周期的顺序。我们的寿命模型是近期神经存活预测工作的新延伸。它使用循环神经网络表示和利用工作间的相关性。我们通过展示它能够准确地生成两个实际云提供商的生产虚拟机工作负载来验证我们的方法。

引用次数: 12

Gradient Compression Supercharged High-Performance Data Parallel DNN Training 梯度压缩增强了高性能数据并行DNN训练

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483553

Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu

Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.

梯度压缩是缓解数据并行深度神经网络(DNN)训练中通信瓶颈的一种很有前途的方法，可以显著减少用于同步的梯度数据量。虽然梯度压缩正在被行业(例如Facebook和AWS)积极采用，但我们的研究表明，存在两个关键但经常被忽视的挑战:1)在梯度同步过程中，压缩和通信之间的低效协调导致了大量的开销;2)开发、优化和将梯度压缩算法集成到深度神经网络系统中，给深度神经网络从业者带来了沉重的负担，而临时压缩实现通常会产生令人惊讶的低系统性能。在本文中，我们首先提出了一种压缩感知的梯度同步架构CaSync，它依赖于基本计算和通信原语的灵活组合。它是通用的，兼容任何梯度压缩算法和梯度同步策略，并实现高性能的计算通信流水线。我们进一步介绍了一个梯度压缩工具包，CompLL，以实现高效的开发和自动集成gpu上的压缩算法到DNN系统中，编程负担小。最后，我们用CaSync和CompLL构建了一个压缩感知的DNN训练框架HiPress。HiPress是开源的，运行在主流的DNN系统上，如MXNet、TensorFlow和PyTorch。通过使用128个NVIDIA V100 gpu和100Gbps网络的16节点集群进行的评估表明，在六种流行的DNN模型中，HiPress比当前支持压缩的系统(例如，BytePS-onebit和Ring-DGC)的训练速度提高了17.2%-69.5%。

{"title":"Gradient Compression Supercharged High-Performance Data Parallel DNN Training","authors":"Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu","doi":"10.1145/3477132.3483553","DOIUrl":"https://doi.org/10.1145/3477132.3483553","url":null,"abstract":"Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86032153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Crash Consistent Non-Volatile Memory Express 崩溃一致非易失性内存Express

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483592

Xiaojian Liao, Youyou Lu, Zhe Yang, J. Shu

This paper presents crash consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus can not fully exploit the multi-queue parallelism and low latency of the NVMe interface. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIO), unlike traditional systems that use complex update protocol and heavyweight block I/Os. ccNVMe introduces transaction-aware MMIO and doorbell to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system namely MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.

本文介绍了崩溃一致性非易失性内存Express (ccNVMe)，这是NVMe的一种新扩展，它定义了主机软件如何通过PCI Express总线与非易失性内存(例如固态驱动器)通信，同时具有崩溃一致性和性能效率。现有的存储系统在崩溃一致性方面付出了巨大的代价，因此不能充分利用NVMe接口的多队列并行性和低延迟。ccNVMe通过将崩溃一致性与数据分发相结合，缓解了这一主要瓶颈。与使用复杂更新协议和重量级块I/ o的传统系统不同，这种新想法允许存储系统通过免费使用NVMe的数据传播机制，仅使用两个轻量级内存映射I/ o (MMIO)来实现崩溃一致性。ccNVMe引入了事务感知的MMIO和门铃，以减少PCIe流量并提供原子性。我们介绍了如何在ccNVMe之上构建高性能和崩溃一致的文件系统，即MQFS。我们通过实验表明，与最先进的文件系统和没有日志记录的Ext4相比，MQFS使RocksDB的IOPS分别提高了36%和28%。

{"title":"Crash Consistent Non-Volatile Memory Express","authors":"Xiaojian Liao, Youyou Lu, Zhe Yang, J. Shu","doi":"10.1145/3477132.3483592","DOIUrl":"https://doi.org/10.1145/3477132.3483592","url":null,"abstract":"This paper presents crash consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus can not fully exploit the multi-queue parallelism and low latency of the NVMe interface. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIO), unlike traditional systems that use complex update protocol and heavyweight block I/Os. ccNVMe introduces transaction-aware MMIO and doorbell to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system namely MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87034450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Understanding and Detecting Software Upgrade Failures in Distributed Systems 分布式系统中软件升级失败的理解与检测

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483577

Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan

Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.

升级是破坏分布式系统可用性的最具破坏性但又不可避免的维护任务之一。升级过程中的任何失败都是灾难性的，因为它进一步扩展了升级造成的服务中断。持续部署的日益普及进一步增加了升级任务的频率和负担。在实践中，升级失败导致了当今许多引人注目的云中断。不幸的是，人们对它们的特征知之甚少。本文对用户在8个广泛使用的分布式系统中报告的123个实际升级失败进行了深入研究，揭示了升级失败的严重性、根本原因、暴露条件和修复策略。在我们研究的指导下，我们设计了一个测试框架DUPTester，它揭示了4个分布式系统中20个以前未知的升级失败，并应用了一系列静态检查器DUPChecker，它发现了800多个可能导致升级失败的跨版本数据格式不兼容。HBase开发人员一直要求将DUPChecker集成到他们的工具链中。

{"title":"Understanding and Detecting Software Upgrade Failures in Distributed Systems","authors":"Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan","doi":"10.1145/3477132.3483577","DOIUrl":"https://doi.org/10.1145/3477132.3483577","url":null,"abstract":"Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82368564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Exploiting Nil-Externality for Fast Replicated Storage 利用零外部性实现快速复制存储

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483543

Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this paper, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: for many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3x lower latency while providing the same high throughput offered by throughput-optimized Paxos.

是否某些存储接口比其他存储接口提供更高的性能?能否识别并利用这些接口来实现存储系统的高性能?本文通过识别存储接口的非外部性，对这些问题给出了肯定的回答。非外部化(nilext)接口可以修改存储系统内的状态，但不会将其效果或系统状态立即外部化到外部世界。因此，存储系统可以延迟应用next操作，从而提高性能。在本文中，我们利用nilext接口来构建高性能的复制存储。我们实现了Skyros，这是一种可感知next的复制协议，它通过延迟排序和执行操作来提供高性能，直到它们的效果被外部化。我们展示了利用零外部性提供了显著的好处:对于许多工作负载，Skyros提供了比标准的基于共识的复制更高的性能。例如，Skyros的延迟降低了3倍，同时提供了与吞吐量优化的Paxos相同的高吞吐量。

引用次数: 6

Log-structured Protocols in Delos Delos中的日志结构协议

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483544

M. Balakrishnan, Chen Shen, AH Jafri, Suyog Mapara, D. Geraghty, J. Flinn, Vidhya Venkat, I. Nedelchev, Santosh K. Ghosh, Mihir Dharamshi, Jingming Liu, Filip Gruszczyński, Jun Li, Rounak Tibrewal, Ali Zaveri, Rajeev Nagar, Ahmed Yossef, Francois Richard, YeeJiun Song

Developers have access to a wide range of storage APIs and functionality in large-scale systems, such as relational databases, key-value stores, and namespaces. However, this diversity comes at a cost: each API is implemented by a complex distributed system that is difficult to develop and operate. Delos amortizes this cost by enabling different APIs on a shared codebase and operational platform. The primary innovation in Delos is a log-structured protocol: a fine-grained replicated state machine executing above a shared log that can be layered into reusable protocol stacks under different databases. We built and deployed two production databases using Delos at Facebook, creating nine different log-structured protocols in the process. We show via experiments and production data that log-structured protocols impose low overhead, while allowing optimizations that can improve latency by up to 100X (e.g., via leasing) and throughput by up to 2X (e.g., via batching).

开发人员可以访问大型系统中的各种存储api和功能，例如关系数据库、键值存储和名称空间。然而，这种多样性是有代价的:每个API都是由一个复杂的分布式系统实现的，很难开发和操作。Delos通过在共享的代码库和操作平台上启用不同的api来分摊这一成本。Delos的主要创新是日志结构协议:在共享日志之上执行的细粒度复制状态机，可以将其分层到不同数据库下的可重用协议堆栈中。我们在Facebook上使用Delos构建并部署了两个生产数据库，在此过程中创建了9个不同的日志结构协议。我们通过实验和生产数据表明，日志结构化协议的开销很低，同时允许优化，可以将延迟提高100倍(例如，通过租赁)，吞吐量提高2倍(例如，通过批处理)。

{"title":"Log-structured Protocols in Delos","authors":"M. Balakrishnan, Chen Shen, AH Jafri, Suyog Mapara, D. Geraghty, J. Flinn, Vidhya Venkat, I. Nedelchev, Santosh K. Ghosh, Mihir Dharamshi, Jingming Liu, Filip Gruszczyński, Jun Li, Rounak Tibrewal, Ali Zaveri, Rajeev Nagar, Ahmed Yossef, Francois Richard, YeeJiun Song","doi":"10.1145/3477132.3483544","DOIUrl":"https://doi.org/10.1145/3477132.3483544","url":null,"abstract":"Developers have access to a wide range of storage APIs and functionality in large-scale systems, such as relational databases, key-value stores, and namespaces. However, this diversity comes at a cost: each API is implemented by a complex distributed system that is difficult to develop and operate. Delos amortizes this cost by enabling different APIs on a shared codebase and operational platform. The primary innovation in Delos is a log-structured protocol: a fine-grained replicated state machine executing above a shared log that can be layered into reusable protocol stacks under different databases. We built and deployed two production databases using Delos at Facebook, creating nine different log-structured protocols in the process. We show via experiments and production data that log-structured protocols impose low overhead, while allowing optimizations that can improve latency by up to 100X (e.g., via leasing) and throughput by up to 2X (e.g., via batching).","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78868848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems 微秒级数据中心系统的半内核数据路径操作系统体系结构

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483569

Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, J. Nelson, O. N. Leija, Ashlie Martinez, Jing Liu, A. K. Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, Anirudh Badam

Datacenter systems and I/O devices now run at single-digit microsecond latencies, requiring ns-scale operating systems. Traditional kernel-based operating systems impose an unaffordable overhead, so recent kernel-bypass OSes [73] and libraries [23] eliminate the OS kernel from the I/O datapath. However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of μs-scale systems.' AB@This paper proposes Demikernel, a flexible datapath OS and architecture designed for heterogenous kernel-bypass devices and μs-scale datacenter systems. We build two prototype Demikernel OSes and show that minimal effort is needed to port existing μs-scale systems. Once ported, Demikernel lets applications run across heterogenous kernel-bypass devices with ns-scale overheads and no code changes.

数据中心系统和I/O设备现在以个位数的微秒延迟运行，需要ns级操作系统。传统的基于内核的操作系统带来了难以承受的开销，所以最近的绕过内核的操作系统[73]和库[23]从I/O数据路径中消除了操作系统内核。然而，这些系统都没有提供满足μs级系统需求的通用数据路径OS替代品。AB@This论文提出了一种为异构内核旁路设备和μs级数据中心系统设计的灵活的数据路径操作系统和体系结构。我们构建了两个原型半内核操作系统，并展示了移植现有μs级系统所需的最小工作量。一旦移植，Demikernel允许应用程序跨异构内核旁路设备运行，开销为ns级，且无需更改代码。

引用次数: 49

ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling ghOSt: Linux调度的快速灵活的用户空间委托

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483542

J. Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, L. Rizzo, Oleg Rombakh, Paul Turner, C. Kozyrakis

We present ghOSt, our infrastructure for delegating kernel scheduling decisions to userspace code. ghOSt is designed to support the rapidly evolving needs of our data center workloads and platforms. Improving scheduling decisions can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement, test, and deploy efficiently across a large fleet. Recent research suggests bespoke scheduling policies, within custom data plane operating systems, can provide compelling performance results in a data center setting. However, these gains have proved difficult to realize as it is impractical to deploy a custom OS image(s) at an application granularity, particularly in a multi-tenant environment, limiting the practical applications of these new techniques. ghOSt provides general-purpose delegation of scheduling policies to userspace processes in a Linux environment. ghOSt provides state encapsulation, communication, and action mechanisms that allow complex expression of scheduling policies within a userspace agent, while assisting in synchronization. Programmers use any language to develop and optimize policies, which are modified without a host reboot. ghOSt supports a wide range of scheduling models, from per-CPU to centralized, run-to-completion to preemptive, and incurs low overheads for scheduling actions. We demonstrate ghOSt's performance on both academic and real-world workloads, including Google Snap and Google Search. We show that by using ghOSt instead of the kernel scheduler, we can quickly achieve comparable throughput and latency while enabling policy optimization, non-disruptive upgrades, and fault isolation for our data center workloads. We open-source our implementation to enable future research and development based on ghOSt.

我们介绍ghOSt，这是我们将内核调度决策委托给用户空间代码的基础设施。ghOSt旨在支持我们的数据中心工作负载和平台快速发展的需求。改进调度决策可以极大地提高重要工作负载的吞吐量、尾部延迟、可伸缩性和安全性。然而，内核调度器很难在大型队列中高效地实现、测试和部署。最近的研究表明，在自定义数据平面操作系统中，定制调度策略可以在数据中心设置中提供引人注目的性能结果。然而，事实证明，这些好处很难实现，因为在应用程序粒度上部署自定义操作系统映像是不切实际的，特别是在多租户环境中，这限制了这些新技术的实际应用。ghOSt为Linux环境中的用户空间进程提供了调度策略的通用委托。ghOSt提供状态封装、通信和操作机制，允许在用户空间代理中复杂地表达调度策略，同时协助同步。程序员可以使用任何语言来开发和优化策略，而无需重新启动主机即可修改这些策略。ghOSt支持广泛的调度模型，从单cpu调度到集中式调度，从运行到完成调度到抢占调度，并且调度操作的开销很低。我们演示了ghOSt在学术和实际工作负载(包括Google Snap和Google Search)上的性能。通过使用ghOSt而不是内核调度器，我们可以快速实现相当的吞吐量和延迟，同时为数据中心工作负载启用策略优化、非中断升级和故障隔离。我们开源了我们的实现，以便将来基于ghOSt的研究和开发。

{"title":"ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling","authors":"J. Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, L. Rizzo, Oleg Rombakh, Paul Turner, C. Kozyrakis","doi":"10.1145/3477132.3483542","DOIUrl":"https://doi.org/10.1145/3477132.3483542","url":null,"abstract":"We present ghOSt, our infrastructure for delegating kernel scheduling decisions to userspace code. ghOSt is designed to support the rapidly evolving needs of our data center workloads and platforms. Improving scheduling decisions can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement, test, and deploy efficiently across a large fleet. Recent research suggests bespoke scheduling policies, within custom data plane operating systems, can provide compelling performance results in a data center setting. However, these gains have proved difficult to realize as it is impractical to deploy a custom OS image(s) at an application granularity, particularly in a multi-tenant environment, limiting the practical applications of these new techniques. ghOSt provides general-purpose delegation of scheduling policies to userspace processes in a Linux environment. ghOSt provides state encapsulation, communication, and action mechanisms that allow complex expression of scheduling policies within a userspace agent, while assisting in synchronization. Programmers use any language to develop and optimize policies, which are modified without a host reboot. ghOSt supports a wide range of scheduling models, from per-CPU to centralized, run-to-completion to preemptive, and incurs low overheads for scheduling actions. We demonstrate ghOSt's performance on both academic and real-world workloads, including Google Snap and Google Search. We show that by using ghOSt instead of the kernel scheduler, we can quickly achieve comparable throughput and latency while enabling policy optimization, non-disruptive upgrades, and fault isolation for our data center workloads. We open-source our implementation to enable future research and development based on ghOSt.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87284501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Geometric Partitioning: Explore the Boundary of Optimal Erasure Code Repair 几何分割:探索最优擦除码修复的边界

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483558

Yingdi Shan, Kang Chen, Tuoyu Gong, Lidong Zhou, Tai Zhou, Yongwei Wu

Erasure coding is widely used in building reliable distributed object storage systems despite its high repair cost. Regenerating codes are a special class of erasure codes, which are proposed to minimize the amount of data needed for repair. In this paper, we assess how optimal repair can help to improve object storage systems, and we find that regenerating codes present unique challenges: regenerating codes repair at the granularity of chunks instead of bytes, and the choice of chunk size leads to the tension between streamed degraded read time and repair throughput. To address this dilemma, we propose Geometric Partitioning, which partitions each object into a series of chunks with their sizes in a geometric sequence to obtain the benefits of both large and small chunk sizes. Geometric Partitioning helps regenerating codes to achieve 1.85x recovery performance of RS code while keeping degraded read time low.

Erasure编码被广泛应用于构建可靠的分布式对象存储系统，但其修复成本较高。再生码是一种特殊的擦除码，它的提出是为了尽量减少修复所需的数据量。在本文中，我们评估了最优修复如何有助于改进对象存储系统，我们发现再生代码提出了独特的挑战:以块而不是字节粒度再生代码修复，并且块大小的选择导致流降级读取时间和修复吞吐量之间的紧张关系。为了解决这个难题，我们提出了几何分区，它将每个对象按照其大小按几何顺序划分为一系列块，以获得大块和小块大小的好处。几何分区有助于重新生成代码，实现1.85倍的RS代码恢复性能，同时保持较低的降级读取时间。

引用次数: 7

Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock 物以类聚:用Flock缩放RDMA rpc

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483576

S. Monga, Sanidhya Kashyap, Changwoo Min

RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.

支持rdma的网络由于其高吞吐量、低延迟、CPU效率和高级特性(如远程内存操作)，在数据中心部署中越来越受欢迎。然而，在高扇入、扇出非对称网络拓扑的常见设置中有效利用RDMA功能是具有挑战性的。例如，使用RDMA编程特性是以牺牲连接可伸缩性为代价的，连接可伸缩性不能随着集群大小的增加而扩展。为了解决这个问题，一些作品放弃了一些RDMA特性，只关注传统的RPC api。在这项工作中，我们努力利用RDMA的全部功能，同时无论集群大小如何扩展连接数量。我们提出Flock，一个用于RDMA网络的通信框架，它使用硬件提供可靠的连接。Flock使用部分共享模型，通过支持线程之间的连接共享，与传统的RDMA设计不同，这提供了显著的性能改进，而不是人们普遍认为的连接共享会降低性能。Flock的核心是使用连接句柄抽象实现连接多路复用;一种基于聚并的高效网络同步方法以及具有共生发送-接收调度的连接的负载控制机制，它减少了与连接共享相关的同步开销，并确保公平利用网络连接。我们展示了分布式事务处理系统和内存索引的好处，其中它的性能分别比其他RPC系统高出88%和50%，并且显著降低了中位数和尾部延迟。

{"title":"Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock","authors":"S. Monga, Sanidhya Kashyap, Changwoo Min","doi":"10.1145/3477132.3483576","DOIUrl":"https://doi.org/10.1145/3477132.3483576","url":null,"abstract":"RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90871164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Operating Systems Review (ACM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀