首页 > 最新文献

Operating Systems Review (ACM)最新文献

英文 中文
Gradient Compression Supercharged High-Performance Data Parallel DNN Training 梯度压缩增强了高性能数据并行DNN训练
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483553
Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu
Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.
梯度压缩是缓解数据并行深度神经网络(DNN)训练中通信瓶颈的一种很有前途的方法,可以显著减少用于同步的梯度数据量。虽然梯度压缩正在被行业(例如Facebook和AWS)积极采用,但我们的研究表明,存在两个关键但经常被忽视的挑战:1)在梯度同步过程中,压缩和通信之间的低效协调导致了大量的开销;2)开发、优化和将梯度压缩算法集成到深度神经网络系统中,给深度神经网络从业者带来了沉重的负担,而临时压缩实现通常会产生令人惊讶的低系统性能。在本文中,我们首先提出了一种压缩感知的梯度同步架构CaSync,它依赖于基本计算和通信原语的灵活组合。它是通用的,兼容任何梯度压缩算法和梯度同步策略,并实现高性能的计算通信流水线。我们进一步介绍了一个梯度压缩工具包,CompLL,以实现高效的开发和自动集成gpu上的压缩算法到DNN系统中,编程负担小。最后,我们用CaSync和CompLL构建了一个压缩感知的DNN训练框架HiPress。HiPress是开源的,运行在主流的DNN系统上,如MXNet、TensorFlow和PyTorch。通过使用128个NVIDIA V100 gpu和100Gbps网络的16节点集群进行的评估表明,在六种流行的DNN模型中,HiPress比当前支持压缩的系统(例如,BytePS-onebit和Ring-DGC)的训练速度提高了17.2%-69.5%。
{"title":"Gradient Compression Supercharged High-Performance Data Parallel DNN Training","authors":"Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, Yinlong Xu","doi":"10.1145/3477132.3483553","DOIUrl":"https://doi.org/10.1145/3477132.3483553","url":null,"abstract":"Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network (DNN) training by significantly reducing the data volume of gradients for synchronization. While gradient compression is being actively adopted by the industry (e.g., Facebook and AWS), our study reveals that there are two critical but often overlooked challenges: 1) inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads, and 2) developing, optimizing, and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners, and ad-hoc compression implementations often yield surprisingly poor system performance. In this paper, we first propose a compression-aware gradient synchronization architecture, CaSync, which relies on a flexible composition of basic computing and communication primitives. It is general and compatible with any gradient compression algorithms and gradient synchronization strategies, and enables high-performance computation-communication pipelining. We further introduce a gradient compression toolkit, CompLL, to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden. Lastly, we build a compression-aware DNN training framework HiPress with CaSync and CompLL. HiPress is open-sourced and runs on mainstream DNN systems such as MXNet, TensorFlow, and PyTorch. Evaluation via a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed over current compression-enabled systems (e.g., BytePS-onebit and Ring-DGC) by 17.2%-69.5% across six popular DNN models.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86032153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Exploiting Nil-Externality for Fast Replicated Storage 利用零外部性实现快速复制存储
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483543
Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this paper, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: for many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3x lower latency while providing the same high throughput offered by throughput-optimized Paxos.
是否某些存储接口比其他存储接口提供更高的性能?能否识别并利用这些接口来实现存储系统的高性能?本文通过识别存储接口的非外部性,对这些问题给出了肯定的回答。非外部化(nilext)接口可以修改存储系统内的状态,但不会将其效果或系统状态立即外部化到外部世界。因此,存储系统可以延迟应用next操作,从而提高性能。在本文中,我们利用nilext接口来构建高性能的复制存储。我们实现了Skyros,这是一种可感知next的复制协议,它通过延迟排序和执行操作来提供高性能,直到它们的效果被外部化。我们展示了利用零外部性提供了显著的好处:对于许多工作负载,Skyros提供了比标准的基于共识的复制更高的性能。例如,Skyros的延迟降低了3倍,同时提供了与吞吐量优化的Paxos相同的高吞吐量。
{"title":"Exploiting Nil-Externality for Fast Replicated Storage","authors":"Aishwarya Ganesan, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/3477132.3483543","DOIUrl":"https://doi.org/10.1145/3477132.3483543","url":null,"abstract":"Do some storage interfaces enable higher performance than others? Can one identify and exploit such interfaces to realize high performance in storage systems? This paper answers these questions in the affirmative by identifying nil-externality, a property of storage interfaces. A nil-externalizing (nilext) interface may modify state within a storage system but does not externalize its effects or system state immediately to the outside world. As a result, a storage system can apply nilext operations lazily, improving performance. In this paper, we take advantage of nilext interfaces to build high-performance replicated storage. We implement Skyros, a nilext-aware replication protocol that offers high performance by deferring ordering and executing operations until their effects are externalized. We show that exploiting nil-externality offers significant benefit: for many workloads, Skyros provides higher performance than standard consensus-based replication. For example, Skyros offers 3x lower latency while providing the same high throughput offered by throughput-optimized Paxos.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86265638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Crash Consistent Non-Volatile Memory Express 崩溃一致非易失性内存Express
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483592
Xiaojian Liao, Youyou Lu, Zhe Yang, J. Shu
This paper presents crash consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus can not fully exploit the multi-queue parallelism and low latency of the NVMe interface. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIO), unlike traditional systems that use complex update protocol and heavyweight block I/Os. ccNVMe introduces transaction-aware MMIO and doorbell to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system namely MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.
本文介绍了崩溃一致性非易失性内存Express (ccNVMe),这是NVMe的一种新扩展,它定义了主机软件如何通过PCI Express总线与非易失性内存(例如固态驱动器)通信,同时具有崩溃一致性和性能效率。现有的存储系统在崩溃一致性方面付出了巨大的代价,因此不能充分利用NVMe接口的多队列并行性和低延迟。ccNVMe通过将崩溃一致性与数据分发相结合,缓解了这一主要瓶颈。与使用复杂更新协议和重量级块I/ o的传统系统不同,这种新想法允许存储系统通过免费使用NVMe的数据传播机制,仅使用两个轻量级内存映射I/ o (MMIO)来实现崩溃一致性。ccNVMe引入了事务感知的MMIO和门铃,以减少PCIe流量并提供原子性。我们介绍了如何在ccNVMe之上构建高性能和崩溃一致的文件系统,即MQFS。我们通过实验表明,与最先进的文件系统和没有日志记录的Ext4相比,MQFS使RocksDB的IOPS分别提高了36%和28%。
{"title":"Crash Consistent Non-Volatile Memory Express","authors":"Xiaojian Liao, Youyou Lu, Zhe Yang, J. Shu","doi":"10.1145/3477132.3483592","DOIUrl":"https://doi.org/10.1145/3477132.3483592","url":null,"abstract":"This paper presents crash consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus can not fully exploit the multi-queue parallelism and low latency of the NVMe interface. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIO), unlike traditional systems that use complex update protocol and heavyweight block I/Os. ccNVMe introduces transaction-aware MMIO and doorbell to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system namely MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87034450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Coeus: A System for Oblivious Document Ranking and Retrieval Coeus:一个遗忘文档排序与检索系统
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483586
Ishtiyaque Ahmad, Laboni Sarker, D. Agrawal, A. E. Abbadi, Trinabh Gupta
Given a private string q and a remote server that holds a set of public documents D, how can one of the K most relevant documents to q in D be selected and viewed without anyone (not even the server) learning anything about q or the document? This is the oblivious document ranking and retrieval problem. In this paper, we describe Coeus, a system that solves this problem. At a high level, Coeus composes two cryptographic primitives: secure matrix-vector product for scoring document relevance using the widely-used term frequency-inverse document frequency (tf-idf) method, and private information retrieval (PIR) for obliviously retrieving documents. However, Coeus reduces the time to run these protocols, thereby improving the user-perceived latency, which is a key performance metric. Coeus first reduces the PIR overhead by separating out private metadata retrieval from document retrieval, and it then scales secure matrix-vector product to tf-idf matrices with several hundred billion elements through a series of novel cryptographic refinements. For a corpus of English Wikipedia containing 5 million documents, a keyword dictionary with 64K keywords, and on a cluster of 143 machines on AWS, Coeus enables a user to obliviously rank and retrieve a document in 3.9 seconds---a 24x improvement over a baseline system.
给定一个私有字符串q和一个持有一组公共文档D的远程服务器,如何在没有任何人(甚至服务器)了解q或文档的情况下选择和查看D中与q最相关的K个文档中的一个?这是无关文档排序和检索问题。在本文中,我们描述了Coeus,一个解决这个问题的系统。在高层次上,Coeus包含两个加密原语:使用广泛使用的术语频率逆文档频率(tf-idf)方法对文档相关性进行评分的安全矩阵-向量积,以及用于隐性检索文档的私有信息检索(PIR)。然而,Coeus减少了运行这些协议的时间,从而提高了用户感知的延迟,这是一个关键的性能指标。Coeus首先通过将私有元数据检索从文档检索中分离出来来减少PIR开销,然后通过一系列新的加密改进将安全矩阵向量积扩展到具有数千亿个元素的tf-idf矩阵。对于包含500万个文档的英文维基百科语料库,一个包含64K个关键字的关键字字典,以及AWS上143台机器的集群,Coeus使用户能够在3.9秒内对文档进行排序和检索,这比基线系统提高了24倍。
{"title":"Coeus: A System for Oblivious Document Ranking and Retrieval","authors":"Ishtiyaque Ahmad, Laboni Sarker, D. Agrawal, A. E. Abbadi, Trinabh Gupta","doi":"10.1145/3477132.3483586","DOIUrl":"https://doi.org/10.1145/3477132.3483586","url":null,"abstract":"Given a private string q and a remote server that holds a set of public documents D, how can one of the K most relevant documents to q in D be selected and viewed without anyone (not even the server) learning anything about q or the document? This is the oblivious document ranking and retrieval problem. In this paper, we describe Coeus, a system that solves this problem. At a high level, Coeus composes two cryptographic primitives: secure matrix-vector product for scoring document relevance using the widely-used term frequency-inverse document frequency (tf-idf) method, and private information retrieval (PIR) for obliviously retrieving documents. However, Coeus reduces the time to run these protocols, thereby improving the user-perceived latency, which is a key performance metric. Coeus first reduces the PIR overhead by separating out private metadata retrieval from document retrieval, and it then scales secure matrix-vector product to tf-idf matrices with several hundred billion elements through a series of novel cryptographic refinements. For a corpus of English Wikipedia containing 5 million documents, a keyword dictionary with 64K keywords, and on a cluster of 143 machines on AWS, Coeus enables a user to obliviously rank and retrieve a document in 3.9 seconds---a 24x improvement over a baseline system.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88056566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Understanding and Detecting Software Upgrade Failures in Distributed Systems 分布式系统中软件升级失败的理解与检测
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483577
Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.
升级是破坏分布式系统可用性的最具破坏性但又不可避免的维护任务之一。升级过程中的任何失败都是灾难性的,因为它进一步扩展了升级造成的服务中断。持续部署的日益普及进一步增加了升级任务的频率和负担。在实践中,升级失败导致了当今许多引人注目的云中断。不幸的是,人们对它们的特征知之甚少。本文对用户在8个广泛使用的分布式系统中报告的123个实际升级失败进行了深入研究,揭示了升级失败的严重性、根本原因、暴露条件和修复策略。在我们研究的指导下,我们设计了一个测试框架DUPTester,它揭示了4个分布式系统中20个以前未知的升级失败,并应用了一系列静态检查器DUPChecker,它发现了800多个可能导致升级失败的跨版本数据格式不兼容。HBase开发人员一直要求将DUPChecker集成到他们的工具链中。
{"title":"Understanding and Detecting Software Upgrade Failures in Distributed Systems","authors":"Yongle Zhang, Junwen Yang, Zhuqi Jin, Utsav Sethi, Kirk Rodrigues, Shan Lu, Ding Yuan","doi":"10.1145/3477132.3483577","DOIUrl":"https://doi.org/10.1145/3477132.3483577","url":null,"abstract":"Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82368564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
PRISM: Rethinking the RDMA Interface for Distributed Systems PRISM:重新思考分布式系统的RDMA接口
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483587
Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, J. Nelson, Irene Zhang, Dan R. K. Ports
Remote Direct Memory Access (RDMA) has been used to accelerate a variety of distributed systems, by providing low-latency, CPU-bypassing access to a remote host's memory. However, most of the distributed protocols used in these systems cannot easily be expressed in terms of the simple memory READs and WRITEs provided by RDMA. As a result, designers face a choice between introducing additional protocol complexity (e.g., additional round trips) or forgoing the benefits of RDMA entirely. This paper argues that an extension to the RDMA interface can resolve this dilemma. We introduce the PRISM interface, which adds four new primitives: indirection, allocation, enhanced compare-and-swap, and operation chaining. These increase the expressivity of the RDMA interface, while still being implementable using the same underlying hardware features. We show their utility by designing three new applications using PRISM primitives, that require little to no server-side CPU involvement: (1) PRISM-KV, a key-value store; (2) PRISM-RS, a replicated block store; and (3) PRISM-TX, a distributed transaction protocol. Using a software-based implementation of the PRISM primitives, we show that these systems outperform prior RDMA-based equivalents.
远程直接内存访问(RDMA)通过提供对远程主机内存的低延迟、绕过cpu的访问,已被用于加速各种分布式系统。然而,在这些系统中使用的大多数分布式协议不能很容易地用RDMA提供的简单内存读和写来表示。因此,设计人员面临着一个选择,是引入额外的协议复杂性(例如,额外的往返),还是完全放弃RDMA的好处。本文认为对RDMA接口的扩展可以解决这一难题。我们引入PRISM接口,它增加了四个新的原语:间接、分配、增强的比较与交换和操作链。这些特性增加了RDMA接口的表现力,同时仍然可以使用相同的底层硬件特性来实现。我们通过设计三个使用PRISM原语的新应用程序来展示它们的实用性,这些应用程序几乎不需要服务器端CPU的参与:(1)PRISM- kv,一个键值存储;(2)复制块存储PRISM-RS;(3)分布式事务协议PRISM-TX。使用PRISM原语的基于软件的实现,我们展示了这些系统优于先前基于rdma的等效系统。
{"title":"PRISM: Rethinking the RDMA Interface for Distributed Systems","authors":"Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, J. Nelson, Irene Zhang, Dan R. K. Ports","doi":"10.1145/3477132.3483587","DOIUrl":"https://doi.org/10.1145/3477132.3483587","url":null,"abstract":"Remote Direct Memory Access (RDMA) has been used to accelerate a variety of distributed systems, by providing low-latency, CPU-bypassing access to a remote host's memory. However, most of the distributed protocols used in these systems cannot easily be expressed in terms of the simple memory READs and WRITEs provided by RDMA. As a result, designers face a choice between introducing additional protocol complexity (e.g., additional round trips) or forgoing the benefits of RDMA entirely. This paper argues that an extension to the RDMA interface can resolve this dilemma. We introduce the PRISM interface, which adds four new primitives: indirection, allocation, enhanced compare-and-swap, and operation chaining. These increase the expressivity of the RDMA interface, while still being implementable using the same underlying hardware features. We show their utility by designing three new applications using PRISM primitives, that require little to no server-side CPU involvement: (1) PRISM-KV, a key-value store; (2) PRISM-RS, a replicated block store; and (3) PRISM-TX, a distributed transaction protocol. Using a software-based implementation of the PRISM primitives, we show that these systems outperform prior RDMA-based equivalents.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88306377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Automated SmartNIC Offloading Insights for Network Functions 针对网络功能的自动SmartNIC卸载见解
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483583
Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming G. Liu, S. Narayana, Ang Chen
The gap between CPU and networking speeds has motivated the development of SmartNICs for NF (network functions) offloading. However, offloading performance is predicated upon intricate knowledge about SmartNIC hardware and careful hand-tuning of the ported programs. Today, developers cannot easily reason about the offloading performance or the effectiveness of different porting strategies without resorting to a trial-and-error approach. Clara is an automated tool that improves the productivity of this workflow by generating offloading insights. Our tool can a) analyze a legacy NF in its unported form, predicting its performance characteristics on a SmartNIC (e.g., compute vs. memory intensity); and b) explore and suggest porting strategies for the given NF to achieve higher performance. To achieve these goals, Clara uses program analysis techniques to extract NF features, and combines them with machine learning techniques to handle opaque SmartNIC details. Our evaluation using Click NF programs on a Netronome Smart-NIC shows that Clara achieves high accuracy in its analysis, and that its suggested porting strategies lead to significant performance improvements.
CPU速度和网络速度之间的差距推动了smartnic的发展,用于NF(网络功能)卸载。但是,卸载性能取决于对SmartNIC硬件的复杂了解和对移植程序的仔细手动调优。今天,开发人员不能轻易地推断出卸载性能或不同移植策略的有效性,而不诉诸于试错方法。Clara是一个自动化工具,它通过生成卸载洞察来提高工作流的生产率。我们的工具可以a)分析未移植形式的遗留NF,预测其在SmartNIC上的性能特征(例如,计算强度与内存强度);b)探索并建议给定NF的移植策略,以实现更高的性能。为了实现这些目标,Clara使用程序分析技术提取NF特征,并将其与机器学习技术相结合来处理不透明的SmartNIC细节。我们在Netronome Smart-NIC上使用Click NF程序进行的评估表明,Clara在分析中达到了很高的准确性,并且其建议的移植策略导致了显著的性能改进。
{"title":"Automated SmartNIC Offloading Insights for Network Functions","authors":"Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming G. Liu, S. Narayana, Ang Chen","doi":"10.1145/3477132.3483583","DOIUrl":"https://doi.org/10.1145/3477132.3483583","url":null,"abstract":"The gap between CPU and networking speeds has motivated the development of SmartNICs for NF (network functions) offloading. However, offloading performance is predicated upon intricate knowledge about SmartNIC hardware and careful hand-tuning of the ported programs. Today, developers cannot easily reason about the offloading performance or the effectiveness of different porting strategies without resorting to a trial-and-error approach. Clara is an automated tool that improves the productivity of this workflow by generating offloading insights. Our tool can a) analyze a legacy NF in its unported form, predicting its performance characteristics on a SmartNIC (e.g., compute vs. memory intensity); and b) explore and suggest porting strategies for the given NF to achieve higher performance. To achieve these goals, Clara uses program analysis techniques to extract NF features, and combines them with machine learning techniques to handle opaque SmartNIC details. Our evaluation using Click NF programs on a Netronome Smart-NIC shows that Clara achieves high accuracy in its analysis, and that its suggested porting strategies lead to significant performance improvements.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75782784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Random Walks on Huge Graphs at Cache Efficiency 基于缓存效率的大图随机漫步
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483575
Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, Yongwei Wu
Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.
以随机访问大型工作集为主的数据密集型应用程序无法利用现代处理器的计算能力。图随机游走是许多重要的图处理和学习应用中不可或缺的主力,是这类应用的一个突出例子。现有的图随机漫步系统目前无法匹配gpu侧节点嵌入的训练速度。这项工作揭示了现有的方法不能有效地利用现代CPU内存层次结构,因为人们普遍认为随机漫步的固有随机性和图的倾斜性质使得大多数内存访问是随机的。我们证明,通过仔细划分、重新安排和批处理操作,实际上有大量的空间和时间局部性可以获取。由此产生的系统FlashMob通过使内存访问更加有序和有规律,提高了缓存和内存带宽的利用率。我们还发现,一个经典的组合优化问题(及其精确的伪多项式解)可以应用于复杂的决策制定,以实现准确而有效的数据/任务划分。我们在不同图形上的综合实验表明,我们的系统比现有最快的系统实现了一个数量级的性能改进。它处理58GB的真实图形的每步速度比现有系统处理600KB的玩具图形的速度快。
{"title":"Random Walks on Huge Graphs at Cache Efficiency","authors":"Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, Yongwei Wu","doi":"10.1145/3477132.3483575","DOIUrl":"https://doi.org/10.1145/3477132.3483575","url":null,"abstract":"Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"125 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83306004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems 微秒级数据中心系统的半内核数据路径操作系统体系结构
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483569
Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, J. Nelson, O. N. Leija, Ashlie Martinez, Jing Liu, A. K. Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, Anirudh Badam
Datacenter systems and I/O devices now run at single-digit microsecond latencies, requiring ns-scale operating systems. Traditional kernel-based operating systems impose an unaffordable overhead, so recent kernel-bypass OSes [73] and libraries [23] eliminate the OS kernel from the I/O datapath. However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of μs-scale systems.' AB@This paper proposes Demikernel, a flexible datapath OS and architecture designed for heterogenous kernel-bypass devices and μs-scale datacenter systems. We build two prototype Demikernel OSes and show that minimal effort is needed to port existing μs-scale systems. Once ported, Demikernel lets applications run across heterogenous kernel-bypass devices with ns-scale overheads and no code changes.
数据中心系统和I/O设备现在以个位数的微秒延迟运行,需要ns级操作系统。传统的基于内核的操作系统带来了难以承受的开销,所以最近的绕过内核的操作系统[73]和库[23]从I/O数据路径中消除了操作系统内核。然而,这些系统都没有提供满足μs级系统需求的通用数据路径OS替代品。AB@This论文提出了一种为异构内核旁路设备和μs级数据中心系统设计的灵活的数据路径操作系统和体系结构。我们构建了两个原型半内核操作系统,并展示了移植现有μs级系统所需的最小工作量。一旦移植,Demikernel允许应用程序跨异构内核旁路设备运行,开销为ns级,且无需更改代码。
{"title":"The Demikernel Datapath OS Architecture for Microsecond-scale Datacenter Systems","authors":"Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, J. Nelson, O. N. Leija, Ashlie Martinez, Jing Liu, A. K. Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, Anirudh Badam","doi":"10.1145/3477132.3483569","DOIUrl":"https://doi.org/10.1145/3477132.3483569","url":null,"abstract":"Datacenter systems and I/O devices now run at single-digit microsecond latencies, requiring ns-scale operating systems. Traditional kernel-based operating systems impose an unaffordable overhead, so recent kernel-bypass OSes [73] and libraries [23] eliminate the OS kernel from the I/O datapath. However, none of these systems offer a general-purpose datapath OS replacement that meet the needs of μs-scale systems.' AB@This paper proposes Demikernel, a flexible datapath OS and architecture designed for heterogenous kernel-bypass devices and μs-scale datacenter systems. We build two prototype Demikernel OSes and show that minimal effort is needed to port existing μs-scale systems. Once ported, Demikernel lets applications run across heterogenous kernel-bypass devices with ns-scale overheads and no code changes.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83919958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock 物以类聚:用Flock缩放RDMA rpc
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483576
S. Monga, Sanidhya Kashyap, Changwoo Min
RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.
支持rdma的网络由于其高吞吐量、低延迟、CPU效率和高级特性(如远程内存操作),在数据中心部署中越来越受欢迎。然而,在高扇入、扇出非对称网络拓扑的常见设置中有效利用RDMA功能是具有挑战性的。例如,使用RDMA编程特性是以牺牲连接可伸缩性为代价的,连接可伸缩性不能随着集群大小的增加而扩展。为了解决这个问题,一些作品放弃了一些RDMA特性,只关注传统的RPC api。在这项工作中,我们努力利用RDMA的全部功能,同时无论集群大小如何扩展连接数量。我们提出Flock,一个用于RDMA网络的通信框架,它使用硬件提供可靠的连接。Flock使用部分共享模型,通过支持线程之间的连接共享,与传统的RDMA设计不同,这提供了显著的性能改进,而不是人们普遍认为的连接共享会降低性能。Flock的核心是使用连接句柄抽象实现连接多路复用;一种基于聚并的高效网络同步方法以及具有共生发送-接收调度的连接的负载控制机制,它减少了与连接共享相关的同步开销,并确保公平利用网络连接。我们展示了分布式事务处理系统和内存索引的好处,其中它的性能分别比其他RPC系统高出88%和50%,并且显著降低了中位数和尾部延迟。
{"title":"Birds of a Feather Flock Together: Scaling RDMA RPCs with Flock","authors":"S. Monga, Sanidhya Kashyap, Changwoo Min","doi":"10.1145/3477132.3483576","DOIUrl":"https://doi.org/10.1145/3477132.3483576","url":null,"abstract":"RDMA-capable networks are gaining traction with datacenter deployments due to their high throughput, low latency, CPU efficiency, and advanced features, such as remote memory operations. However, efficiently utilizing RDMA capability in a common setting of high fan-in, fan-out asymmetric network topology is challenging. For instance, using RDMA programming features comes at the cost of connection scalability, which does not scale with increasing cluster size. To address that, several works forgo some RDMA features by only focusing on conventional RPC APIs. In this work, we strive to exploit the full capability of RDMA, while scaling the number of connections regardless of the cluster size. We present Flock, a communication framework for RDMA networks that uses hardware provided reliable connection. Using a partially shared model, Flock departs from the conventional RDMA design by enabling connection sharing among threads, which provides significant performance improvements contrary to the widely held belief that connection sharing deteriorates performance. At its core, Flock uses a connection handle abstraction for connection multiplexing; a new coalescing-based synchronization approach for efficient network utilization; and a load-control mechanism for connections with symbiotic send-recv scheduling, which reduces the synchronization overheads associated with connection sharing along with ensuring fair utilization of network connections. We demonstrate the benefits for a distributed transaction processing system and an in-memory index, where it outperforms other RPC systems by up to 88% and 50%, respectively, with significant reductions in median and tail latency.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90871164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
期刊
Operating Systems Review (ACM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1