首页 > 最新文献

Operating Systems Review (ACM)最新文献

英文 中文
PACTree: A High Performance Persistent Range Index Using PAC Guidelines PACTree:使用PAC指南的高性能持久范围索引
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483589
Wook-Hee Kim, Madhava Krishnan Ramanathan, Xinwei Fu, Sanidhya Kashyap, Changwoo Min
Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM-specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index. We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.
非易失性内存(Non-Volatile Memory, NVM)提供了相对较快的、可字节寻址的持久性,现在已经商业化了。但是,我们不能将真正的NVM等同于慢速DRAM,因为它比我们预期的要复杂得多。在这项工作中,我们重新审视并分析了NVM和NVM特定的持久性内存索引。我们发现,如果考虑NVM硬件、软件堆栈、持久索引设计和并发控制,还有很大的改进空间。根据我们的分析,我们提出了用于设计高性能持久索引结构的打包异步并发(PAC)指南。指导方针背后的关键思想是:1)以打包的方式访问NVM硬件,以最小化其带宽利用率;2)利用异步并发控制,将NVM的长延迟与索引的关键路径解耦。我们开发了PACTree,这是一个遵循PAC指南的高性能持久范围索引。PACTree是一种混合型索引,它对内部节点和B+树状叶节点采用了tree索引。trie索引结构将部分键包在内部节点中。此外,我们解耦了tree索引和B+树状叶节点。解耦允许我们通过异步更新内部节点来防止阻塞并发访问。我们的评估表明,PACTree在性能上比最先进的持久范围索引高出7倍,在99.99%的尾延迟上高出20倍。
{"title":"PACTree: A High Performance Persistent Range Index Using PAC Guidelines","authors":"Wook-Hee Kim, Madhava Krishnan Ramanathan, Xinwei Fu, Sanidhya Kashyap, Changwoo Min","doi":"10.1145/3477132.3483589","DOIUrl":"https://doi.org/10.1145/3477132.3483589","url":null,"abstract":"Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM-specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index. We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80625897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Snowboard 滑雪板
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483549
Sishuai Gong, Deniz Altinbüken, P. Fonseca, Petros Maniatis
Kernel concurrency bugs are challenging to find because they depend on very specific thread interleavings and test inputs. While separately exploring kernel thread interleavings or test inputs has been closely examined, jointly exploring interleavings and test inputs has received little attention, in part due to the resulting vast search space. Using precious, limited testing resources to explore this search space and execute just the right concurrent tests in the proper order is critical. This paper proposes Snowboard a testing framework that generates and executes concurrent tests by intelligently exploring thread interleavings and test inputs jointly. The design of Snowboard is based on a concept called potential memory communication (PMC), a guess about pairs of tests that, when executed concurrently, are likely to perform memory accesses to shared addresses, which in turn may trigger concurrency bugs. To identify PMCs, Snowboard runs tests sequentially from a fixed initial kernel state, collecting their memory accesses. It then pairs up tests that write and read the same region into candidate concurrent tests. It executes those tests using the associated PMC as a scheduling hint to focus interleaving search only on those schedules that directly affect the relevant memory accesses. By clustering candidate tests on various features of their PMCs, Snowboard avoids testing similar behaviors, which would be inefficient. Finally, by executing tests from small clusters first, it prioritizes uncommon suspicious behaviors that may have received less scrutiny. Snowboard discovered 14 new concurrency bugs in Linux kernels 5.3.10 and 5.12-rc3, of which 12 have been confirmed by developers. Six of these bugs cause kernel panics and filesystem errors, and at least two have existed in the kernel for many years, showing that this approach can uncover hard-to-find, critical bugs. Furthermore, we show that covering as many distinct pairs of uncommon read/write instructions as possible is the test-prioritization strategy with the highest bug yield for a given test-time budget.
{"title":"Snowboard","authors":"Sishuai Gong, Deniz Altinbüken, P. Fonseca, Petros Maniatis","doi":"10.1145/3477132.3483549","DOIUrl":"https://doi.org/10.1145/3477132.3483549","url":null,"abstract":"Kernel concurrency bugs are challenging to find because they depend on very specific thread interleavings and test inputs. While separately exploring kernel thread interleavings or test inputs has been closely examined, jointly exploring interleavings and test inputs has received little attention, in part due to the resulting vast search space. Using precious, limited testing resources to explore this search space and execute just the right concurrent tests in the proper order is critical. This paper proposes Snowboard a testing framework that generates and executes concurrent tests by intelligently exploring thread interleavings and test inputs jointly. The design of Snowboard is based on a concept called potential memory communication (PMC), a guess about pairs of tests that, when executed concurrently, are likely to perform memory accesses to shared addresses, which in turn may trigger concurrency bugs. To identify PMCs, Snowboard runs tests sequentially from a fixed initial kernel state, collecting their memory accesses. It then pairs up tests that write and read the same region into candidate concurrent tests. It executes those tests using the associated PMC as a scheduling hint to focus interleaving search only on those schedules that directly affect the relevant memory accesses. By clustering candidate tests on various features of their PMCs, Snowboard avoids testing similar behaviors, which would be inefficient. Finally, by executing tests from small clusters first, it prioritizes uncommon suspicious behaviors that may have received less scrutiny. Snowboard discovered 14 new concurrency bugs in Linux kernels 5.3.10 and 5.12-rc3, of which 12 have been confirmed by developers. Six of these bugs cause kernel panics and filesystem errors, and at least two have existed in the kernel for many years, showing that this approach can uncover hard-to-find, critical bugs. Furthermore, we show that covering as many distinct pairs of uncommon read/write instructions as possible is the test-prioritization strategy with the highest bug yield for a given test-time budget.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76816619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling ghOSt: Linux调度的快速灵活的用户空间委托
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483542
J. Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, L. Rizzo, Oleg Rombakh, Paul Turner, C. Kozyrakis
We present ghOSt, our infrastructure for delegating kernel scheduling decisions to userspace code. ghOSt is designed to support the rapidly evolving needs of our data center workloads and platforms. Improving scheduling decisions can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement, test, and deploy efficiently across a large fleet. Recent research suggests bespoke scheduling policies, within custom data plane operating systems, can provide compelling performance results in a data center setting. However, these gains have proved difficult to realize as it is impractical to deploy a custom OS image(s) at an application granularity, particularly in a multi-tenant environment, limiting the practical applications of these new techniques. ghOSt provides general-purpose delegation of scheduling policies to userspace processes in a Linux environment. ghOSt provides state encapsulation, communication, and action mechanisms that allow complex expression of scheduling policies within a userspace agent, while assisting in synchronization. Programmers use any language to develop and optimize policies, which are modified without a host reboot. ghOSt supports a wide range of scheduling models, from per-CPU to centralized, run-to-completion to preemptive, and incurs low overheads for scheduling actions. We demonstrate ghOSt's performance on both academic and real-world workloads, including Google Snap and Google Search. We show that by using ghOSt instead of the kernel scheduler, we can quickly achieve comparable throughput and latency while enabling policy optimization, non-disruptive upgrades, and fault isolation for our data center workloads. We open-source our implementation to enable future research and development based on ghOSt.
我们介绍ghOSt,这是我们将内核调度决策委托给用户空间代码的基础设施。ghOSt旨在支持我们的数据中心工作负载和平台快速发展的需求。改进调度决策可以极大地提高重要工作负载的吞吐量、尾部延迟、可伸缩性和安全性。然而,内核调度器很难在大型队列中高效地实现、测试和部署。最近的研究表明,在自定义数据平面操作系统中,定制调度策略可以在数据中心设置中提供引人注目的性能结果。然而,事实证明,这些好处很难实现,因为在应用程序粒度上部署自定义操作系统映像是不切实际的,特别是在多租户环境中,这限制了这些新技术的实际应用。ghOSt为Linux环境中的用户空间进程提供了调度策略的通用委托。ghOSt提供状态封装、通信和操作机制,允许在用户空间代理中复杂地表达调度策略,同时协助同步。程序员可以使用任何语言来开发和优化策略,而无需重新启动主机即可修改这些策略。ghOSt支持广泛的调度模型,从单cpu调度到集中式调度,从运行到完成调度到抢占调度,并且调度操作的开销很低。我们演示了ghOSt在学术和实际工作负载(包括Google Snap和Google Search)上的性能。通过使用ghOSt而不是内核调度器,我们可以快速实现相当的吞吐量和延迟,同时为数据中心工作负载启用策略优化、非中断升级和故障隔离。我们开源了我们的实现,以便将来基于ghOSt的研究和开发。
{"title":"ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling","authors":"J. Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, L. Rizzo, Oleg Rombakh, Paul Turner, C. Kozyrakis","doi":"10.1145/3477132.3483542","DOIUrl":"https://doi.org/10.1145/3477132.3483542","url":null,"abstract":"We present ghOSt, our infrastructure for delegating kernel scheduling decisions to userspace code. ghOSt is designed to support the rapidly evolving needs of our data center workloads and platforms. Improving scheduling decisions can drastically improve the throughput, tail latency, scalability, and security of important workloads. However, kernel schedulers are difficult to implement, test, and deploy efficiently across a large fleet. Recent research suggests bespoke scheduling policies, within custom data plane operating systems, can provide compelling performance results in a data center setting. However, these gains have proved difficult to realize as it is impractical to deploy a custom OS image(s) at an application granularity, particularly in a multi-tenant environment, limiting the practical applications of these new techniques. ghOSt provides general-purpose delegation of scheduling policies to userspace processes in a Linux environment. ghOSt provides state encapsulation, communication, and action mechanisms that allow complex expression of scheduling policies within a userspace agent, while assisting in synchronization. Programmers use any language to develop and optimize policies, which are modified without a host reboot. ghOSt supports a wide range of scheduling models, from per-CPU to centralized, run-to-completion to preemptive, and incurs low overheads for scheduling actions. We demonstrate ghOSt's performance on both academic and real-world workloads, including Google Snap and Google Search. We show that by using ghOSt instead of the kernel scheduler, we can quickly achieve comparable throughput and latency while enabling policy optimization, non-disruptive upgrades, and fault isolation for our data center workloads. We open-source our implementation to enable future research and development based on ghOSt.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87284501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Geometric Partitioning: Explore the Boundary of Optimal Erasure Code Repair 几何分割:探索最优擦除码修复的边界
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483558
Yingdi Shan, Kang Chen, Tuoyu Gong, Lidong Zhou, Tai Zhou, Yongwei Wu
Erasure coding is widely used in building reliable distributed object storage systems despite its high repair cost. Regenerating codes are a special class of erasure codes, which are proposed to minimize the amount of data needed for repair. In this paper, we assess how optimal repair can help to improve object storage systems, and we find that regenerating codes present unique challenges: regenerating codes repair at the granularity of chunks instead of bytes, and the choice of chunk size leads to the tension between streamed degraded read time and repair throughput. To address this dilemma, we propose Geometric Partitioning, which partitions each object into a series of chunks with their sizes in a geometric sequence to obtain the benefits of both large and small chunk sizes. Geometric Partitioning helps regenerating codes to achieve 1.85x recovery performance of RS code while keeping degraded read time low.
Erasure编码被广泛应用于构建可靠的分布式对象存储系统,但其修复成本较高。再生码是一种特殊的擦除码,它的提出是为了尽量减少修复所需的数据量。在本文中,我们评估了最优修复如何有助于改进对象存储系统,我们发现再生代码提出了独特的挑战:以块而不是字节粒度再生代码修复,并且块大小的选择导致流降级读取时间和修复吞吐量之间的紧张关系。为了解决这个难题,我们提出了几何分区,它将每个对象按照其大小按几何顺序划分为一系列块,以获得大块和小块大小的好处。几何分区有助于重新生成代码,实现1.85倍的RS代码恢复性能,同时保持较低的降级读取时间。
{"title":"Geometric Partitioning: Explore the Boundary of Optimal Erasure Code Repair","authors":"Yingdi Shan, Kang Chen, Tuoyu Gong, Lidong Zhou, Tai Zhou, Yongwei Wu","doi":"10.1145/3477132.3483558","DOIUrl":"https://doi.org/10.1145/3477132.3483558","url":null,"abstract":"Erasure coding is widely used in building reliable distributed object storage systems despite its high repair cost. Regenerating codes are a special class of erasure codes, which are proposed to minimize the amount of data needed for repair. In this paper, we assess how optimal repair can help to improve object storage systems, and we find that regenerating codes present unique challenges: regenerating codes repair at the granularity of chunks instead of bytes, and the choice of chunk size leads to the tension between streamed degraded read time and repair throughput. To address this dilemma, we propose Geometric Partitioning, which partitions each object into a series of chunks with their sizes in a geometric sequence to obtain the benefits of both large and small chunk sizes. Geometric Partitioning helps regenerating codes to achieve 1.85x recovery performance of RS code while keeping degraded read time low.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84524486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism LineFS:高效的SmartNIC卸载分布式文件系统与流水线并行
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483565
Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, M. Canini, Dejan Kostic, Youngjin Kwon, Simon Peter, E. Witchel
In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.
在多租户系统中,分布式文件系统(dfs)的CPU开销日益成为应用程序性能的负担。CPU和内存干扰会导致应用程序和存储性能下降和不稳定,特别是操作延迟。最近用于持久内存(PM)的客户端本地dfse加速了这一趋势。DFS卸载到SmartNIC是解决这些问题的一个很有前途的解决方案,但是将DFS的复杂需求适应到位于PCIe上的简单SmartNIC处理器上是一个挑战。我们介绍了LineFS,这是一个卸载smartnic的高性能DFS,支持客户端本地PM。为了充分利用SmartNIC架构,我们将DFS操作分解为执行阶段,这些执行阶段可以卸载到SmartNIC上的并行数据路径执行管道上。LineFS将cpu密集型DFS任务(如复制、压缩、数据发布、索引和一致性管理)卸载到Smart-NIC。我们在Mellanox BlueField智能网卡上实现LineFS,并将其与Assise进行比较,Assise是最先进的PM DFS。LineFS将LevelDB中的延迟提高到80%,将Filebench中的吞吐量提高到79%,同时在主机系统故障期间提供扩展的DFS可用性。
{"title":"LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism","authors":"Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, M. Canini, Dejan Kostic, Youngjin Kwon, Simon Peter, E. Witchel","doi":"10.1145/3477132.3483565","DOIUrl":"https://doi.org/10.1145/3477132.3483565","url":null,"abstract":"In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87765290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
CLoF CLoF
Q3 Computer Science Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483557
Rafael Lourenco de Lima Chehab, Antonio Paolillo, Diogo Behrens, M. Fu, Hermann Härtig, Haibo Chen
Efficient locking mechanisms are extremely important to support large-scale concurrency and exploit the performance promises of many-core servers. Implementing an efficient, generic, and correct lock is very challenging due to the differences between various NUMA architectures. The performance impact of architectural/NUMA hierarchy differences between x86 and Armv8 are not yet fully explored, leading to unexpected performance when simply porting NUMA-aware locks from x86 to Armv8. Moreover, due to the Armv8 Weak Memory Model (WMM), correctly implementing complicated NUMA-aware locks is very difficult. We propose a Compositional Lock Framework (CLoF) for multi-level NUMA systems. CLoF composes NUMA-oblivious locks in a hierarchy matching the target platform, leading to hundreds of correct by construction NUMA-aware locks. CLoF can automatically select the best lock among them. To show the correctness of CLoF on WMMs, we provide an inductive argument with base and induction steps verified with model checkers. In our evaluation, CLoF locks outperform state-of-the-art NUMA-aware locks in most scenarios, e.g., in a highly contended LevelDB benchmark, our best CLoF locks yield twice the throughput achieved with CNA lock and ShflLock on large x86 and Armv8 servers.
{"title":"CLoF","authors":"Rafael Lourenco de Lima Chehab, Antonio Paolillo, Diogo Behrens, M. Fu, Hermann Härtig, Haibo Chen","doi":"10.1145/3477132.3483557","DOIUrl":"https://doi.org/10.1145/3477132.3483557","url":null,"abstract":"Efficient locking mechanisms are extremely important to support large-scale concurrency and exploit the performance promises of many-core servers. Implementing an efficient, generic, and correct lock is very challenging due to the differences between various NUMA architectures. The performance impact of architectural/NUMA hierarchy differences between x86 and Armv8 are not yet fully explored, leading to unexpected performance when simply porting NUMA-aware locks from x86 to Armv8. Moreover, due to the Armv8 Weak Memory Model (WMM), correctly implementing complicated NUMA-aware locks is very difficult. We propose a Compositional Lock Framework (CLoF) for multi-level NUMA systems. CLoF composes NUMA-oblivious locks in a hierarchy matching the target platform, leading to hundreds of correct by construction NUMA-aware locks. CLoF can automatically select the best lock among them. To show the correctness of CLoF on WMMs, we provide an inductive argument with base and induction steps verified with model checkers. In our evaluation, CLoF locks outperform state-of-the-art NUMA-aware locks in most scenarios, e.g., in a highly contended LevelDB benchmark, our best CLoF locks yield twice the throughput achieved with CNA lock and ShflLock on large x86 and Armv8 servers.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82326025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP 利用POP高效解决大规模粒度资源分配问题
Q3 Computer Science Pub Date : 2021-10-22 DOI: 10.1145/3477132.3483588
D. Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, M. Zaharia
Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.
许多计算机系统中的资源分配问题可以表述为数学优化问题。然而,对于具有严格sla的大型问题,使用现成的求解器找到这些问题的精确解决方案通常是难以处理的,这导致系统设计人员依赖于廉价的启发式算法。然而,我们观察到,许多分配问题是细粒度的:它们由大量的客户机和资源组成,每个客户机请求的资源只占资源总数的一小部分,并且客户机可以互换地使用不同的资源。对于这些问题,我们提出了一种替代方法,该方法重用原始的优化问题公式,并比特定领域的启发式方法更好地进行分配。我们的技术,分区优化问题(POP),将问题随机分割为更小的问题(系统中有客户端和资源的子集),并将结果子分配合并为所有客户端的全局分配。我们提供理论和经验的证据,为什么随机划分工作良好。在我们的实验中,与现有的集群调度、流量工程和负载平衡系统相比,POP在运行时实现了1.5%的最优分配,并在运行时进行了数量级的改进。
{"title":"Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP","authors":"D. Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, M. Zaharia","doi":"10.1145/3477132.3483588","DOIUrl":"https://doi.org/10.1145/3477132.3483588","url":null,"abstract":"Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88552522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Rabia: Simplifying State-Machine Replication Through Randomization Rabia:通过随机化简化状态机复制
Q3 Computer Science Pub Date : 2021-09-26 DOI: 10.1145/3477132.3483582
Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, R. Palmieri
We introduce Rabia, a simple and high performance framework for implementing state-machine replication (SMR) within a datacenter. The main innovation of Rabia is in using randomization to simplify the design. Rabia provides the following two features: (i) It does not need any fail-over protocol and supports trivial auxiliary protocols like log compaction, snapshotting, and reconfiguration, components that are often considered the most challenging when developing SMR systems; and (ii) It provides high performance, up to 1.5x higher throughput than the closest competitor (i.e., EPaxos) in a favorable setup (same availability zone with three replicas) and is comparable with a larger number of replicas or when deployed in multiple availability zones.
我们介绍Rabia,这是一个简单的高性能框架,用于在数据中心内实现状态机复制(SMR)。Rabia的主要创新是使用随机化来简化设计。Rabia提供了以下两个特性:(i)它不需要任何故障转移协议,并支持琐碎的辅助协议,如日志压缩、快照和重新配置,这些组件通常被认为是开发SMR系统时最具挑战性的;(ii)它提供高性能,在有利的设置(具有三个副本的相同可用区)中,吞吐量比最接近的竞争对手(即EPaxos)高1.5倍,并且与大量副本或部署在多个可用区时相当。
{"title":"Rabia: Simplifying State-Machine Replication Through Randomization","authors":"Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, R. Palmieri","doi":"10.1145/3477132.3483582","DOIUrl":"https://doi.org/10.1145/3477132.3483582","url":null,"abstract":"We introduce Rabia, a simple and high performance framework for implementing state-machine replication (SMR) within a datacenter. The main innovation of Rabia is in using randomization to simplify the design. Rabia provides the following two features: (i) It does not need any fail-over protocol and supports trivial auxiliary protocols like log compaction, snapshotting, and reconfiguration, components that are often considered the most challenging when developing SMR systems; and (ii) It provides high performance, up to 1.5x higher throughput than the closest competitor (i.e., EPaxos) in a favorable setup (same availability zone with three replicas) and is comparable with a larger number of replicas or when deployed in multiple availability zones.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88155093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Basil: Breaking up BFT with ACID (transactions) Basil:用ACID(交易)分解BFT
Q3 Computer Science Pub Date : 2021-09-25 DOI: 10.1145/3477132.3483552
Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, Natacha Crooks
This paper presents Basil, the first transactional, leaderless Byzantine Fault Tolerant key-value store. Basil leverages ACID transactions to scalably implement the abstraction of a trusted shared log in the presence of Byzantine actors. Unlike traditional BFT approaches, Basil executes non-conflicting operations in parallel and commits transactions in a single round-trip during fault-free executions. Basil improves throughput over traditional BFT systems by four to five times, and is only four times slower than TAPIR, a non-Byzantine replicated system. Basil's novel recovery mechanism further minimizes the impact of failures: with 30% Byzantine clients, throughput drops by less than 25% in the worst-case.
本文介绍了Basil,第一个事务性的、无领导的拜占庭式容错键值存储。Basil利用ACID事务在拜占庭参与者存在的情况下可伸缩地实现可信共享日志的抽象。与传统的BFT方法不同,Basil并行执行无冲突的操作,并在无故障执行期间在单次往返中提交事务。Basil将传统BFT系统的吞吐量提高了4到5倍,仅比TAPIR慢4倍,TAPIR是一种非拜占庭式复制系统。Basil的新型恢复机制进一步减少了故障的影响:在30%的拜占庭客户端情况下,吞吐量在最坏情况下下降不到25%。
{"title":"Basil: Breaking up BFT with ACID (transactions)","authors":"Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, Natacha Crooks","doi":"10.1145/3477132.3483552","DOIUrl":"https://doi.org/10.1145/3477132.3483552","url":null,"abstract":"This paper presents Basil, the first transactional, leaderless Byzantine Fault Tolerant key-value store. Basil leverages ACID transactions to scalably implement the abstraction of a trusted shared log in the presence of Byzantine actors. Unlike traditional BFT approaches, Basil executes non-conflicting operations in parallel and commits transactions in a single round-trip during fault-free executions. Basil improves throughput over traditional BFT systems by four to five times, and is only four times slower than TAPIR, a non-Byzantine replicated system. Basil's novel recovery mechanism further minimizes the impact of failures: with 30% Byzantine clients, throughput drops by less than 25% in the worst-case.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90406098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Regular Sequential Serializability and Regular Sequential Consistency 常规顺序可序列化性和常规顺序一致性
Q3 Computer Science Pub Date : 2021-09-18 DOI: 10.1145/3477132.3483566
Jeffrey Helt, Matthew Burke, A. Levy, Wyatt Lloyd
Strictly serializable (linearizable) services appear to execute transactions (operations) sequentially, in an order consistent with real time. This restricts a transaction's (operation's) possible return values and in turn, simplifies application programming. In exchange, strictly serializable (linearizable) services perform worse than those with weaker consistency. But switching to such services can break applications. This work introduces two new consistency models to ease this trade-off: regular sequential serializability (RSS) and regular sequential consistency (RSC). They are just as strong for applications: we prove any application invariant that holds when using a strictly serializable (linearizable) service also holds when using an RSS (RSC) service. Yet they relax the constraints on services---they allow new, better-performing designs. To demonstrate this, we design, implement, and evaluate variants of two systems, Spanner and Gryff, relaxing their consistency to RSS and RSC, respectively. The new variants achieve better read-only transaction and read tail latency than their counterparts.
严格序列化(线性化)的服务似乎按顺序执行事务(操作),其顺序与实时一致。这限制了事务(操作)可能的返回值,从而简化了应用程序编程。作为交换,严格序列化(线性化)的服务比一致性较弱的服务性能更差。但是切换到这样的服务可能会破坏应用程序。这项工作引入了两个新的一致性模型来缓解这种权衡:常规顺序序列化性(RSS)和常规顺序一致性(RSC)。它们对于应用程序同样强大:我们证明了当使用严格序列化(线性化)服务时成立的任何应用程序不变量在使用RSS (RSC)服务时也成立。然而,它们放宽了对服务的限制——它们允许新的、性能更好的设计。为了证明这一点,我们设计、实现和评估了两个系统Spanner和Gryff的变体,分别放宽了它们与RSS和RSC的一致性。新的变体实现了更好的只读事务和读尾延迟。
{"title":"Regular Sequential Serializability and Regular Sequential Consistency","authors":"Jeffrey Helt, Matthew Burke, A. Levy, Wyatt Lloyd","doi":"10.1145/3477132.3483566","DOIUrl":"https://doi.org/10.1145/3477132.3483566","url":null,"abstract":"Strictly serializable (linearizable) services appear to execute transactions (operations) sequentially, in an order consistent with real time. This restricts a transaction's (operation's) possible return values and in turn, simplifies application programming. In exchange, strictly serializable (linearizable) services perform worse than those with weaker consistency. But switching to such services can break applications. This work introduces two new consistency models to ease this trade-off: regular sequential serializability (RSS) and regular sequential consistency (RSC). They are just as strong for applications: we prove any application invariant that holds when using a strictly serializable (linearizable) service also holds when using an RSS (RSC) service. Yet they relax the constraints on services---they allow new, better-performing designs. To demonstrate this, we design, implement, and evaluate variants of two systems, Spanner and Gryff, relaxing their consistency to RSS and RSC, respectively. The new variants achieve better read-only transaction and read tail latency than their counterparts.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78893665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Operating Systems Review (ACM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1