Operating Systems Review (ACM)最新文献

英文中文

Snowboard 滑雪板

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483549

Sishuai Gong, Deniz Altinbüken, P. Fonseca, Petros Maniatis

Kernel concurrency bugs are challenging to find because they depend on very specific thread interleavings and test inputs. While separately exploring kernel thread interleavings or test inputs has been closely examined, jointly exploring interleavings and test inputs has received little attention, in part due to the resulting vast search space. Using precious, limited testing resources to explore this search space and execute just the right concurrent tests in the proper order is critical. This paper proposes Snowboard a testing framework that generates and executes concurrent tests by intelligently exploring thread interleavings and test inputs jointly. The design of Snowboard is based on a concept called potential memory communication (PMC), a guess about pairs of tests that, when executed concurrently, are likely to perform memory accesses to shared addresses, which in turn may trigger concurrency bugs. To identify PMCs, Snowboard runs tests sequentially from a fixed initial kernel state, collecting their memory accesses. It then pairs up tests that write and read the same region into candidate concurrent tests. It executes those tests using the associated PMC as a scheduling hint to focus interleaving search only on those schedules that directly affect the relevant memory accesses. By clustering candidate tests on various features of their PMCs, Snowboard avoids testing similar behaviors, which would be inefficient. Finally, by executing tests from small clusters first, it prioritizes uncommon suspicious behaviors that may have received less scrutiny. Snowboard discovered 14 new concurrency bugs in Linux kernels 5.3.10 and 5.12-rc3, of which 12 have been confirmed by developers. Six of these bugs cause kernel panics and filesystem errors, and at least two have existed in the kernel for many years, showing that this approach can uncover hard-to-find, critical bugs. Furthermore, we show that covering as many distinct pairs of uncommon read/write instructions as possible is the test-prioritization strategy with the highest bug yield for a given test-time budget.

{"title":"Snowboard","authors":"Sishuai Gong, Deniz Altinbüken, P. Fonseca, Petros Maniatis","doi":"10.1145/3477132.3483549","DOIUrl":"https://doi.org/10.1145/3477132.3483549","url":null,"abstract":"Kernel concurrency bugs are challenging to find because they depend on very specific thread interleavings and test inputs. While separately exploring kernel thread interleavings or test inputs has been closely examined, jointly exploring interleavings and test inputs has received little attention, in part due to the resulting vast search space. Using precious, limited testing resources to explore this search space and execute just the right concurrent tests in the proper order is critical. This paper proposes Snowboard a testing framework that generates and executes concurrent tests by intelligently exploring thread interleavings and test inputs jointly. The design of Snowboard is based on a concept called potential memory communication (PMC), a guess about pairs of tests that, when executed concurrently, are likely to perform memory accesses to shared addresses, which in turn may trigger concurrency bugs. To identify PMCs, Snowboard runs tests sequentially from a fixed initial kernel state, collecting their memory accesses. It then pairs up tests that write and read the same region into candidate concurrent tests. It executes those tests using the associated PMC as a scheduling hint to focus interleaving search only on those schedules that directly affect the relevant memory accesses. By clustering candidate tests on various features of their PMCs, Snowboard avoids testing similar behaviors, which would be inefficient. Finally, by executing tests from small clusters first, it prioritizes uncommon suspicious behaviors that may have received less scrutiny. Snowboard discovered 14 new concurrency bugs in Linux kernels 5.3.10 and 5.12-rc3, of which 12 have been confirmed by developers. Six of these bugs cause kernel panics and filesystem errors, and at least two have existed in the kernel for many years, showing that this approach can uncover hard-to-find, critical bugs. Furthermore, we show that covering as many distinct pairs of uncommon read/write instructions as possible is the test-prioritization strategy with the highest bug yield for a given test-time budget.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76816619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automated SmartNIC Offloading Insights for Network Functions 针对网络功能的自动SmartNIC卸载见解

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483583

Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming G. Liu, S. Narayana, Ang Chen

The gap between CPU and networking speeds has motivated the development of SmartNICs for NF (network functions) offloading. However, offloading performance is predicated upon intricate knowledge about SmartNIC hardware and careful hand-tuning of the ported programs. Today, developers cannot easily reason about the offloading performance or the effectiveness of different porting strategies without resorting to a trial-and-error approach. Clara is an automated tool that improves the productivity of this workflow by generating offloading insights. Our tool can a) analyze a legacy NF in its unported form, predicting its performance characteristics on a SmartNIC (e.g., compute vs. memory intensity); and b) explore and suggest porting strategies for the given NF to achieve higher performance. To achieve these goals, Clara uses program analysis techniques to extract NF features, and combines them with machine learning techniques to handle opaque SmartNIC details. Our evaluation using Click NF programs on a Netronome Smart-NIC shows that Clara achieves high accuracy in its analysis, and that its suggested porting strategies lead to significant performance improvements.

CPU速度和网络速度之间的差距推动了smartnic的发展，用于NF(网络功能)卸载。但是，卸载性能取决于对SmartNIC硬件的复杂了解和对移植程序的仔细手动调优。今天，开发人员不能轻易地推断出卸载性能或不同移植策略的有效性，而不诉诸于试错方法。Clara是一个自动化工具，它通过生成卸载洞察来提高工作流的生产率。我们的工具可以a)分析未移植形式的遗留NF，预测其在SmartNIC上的性能特征(例如，计算强度与内存强度);b)探索并建议给定NF的移植策略，以实现更高的性能。为了实现这些目标，Clara使用程序分析技术提取NF特征，并将其与机器学习技术相结合来处理不透明的SmartNIC细节。我们在Netronome Smart-NIC上使用Click NF程序进行的评估表明，Clara在分析中达到了很高的准确性，并且其建议的移植策略导致了显著的性能改进。

{"title":"Automated SmartNIC Offloading Insights for Network Functions","authors":"Yiming Qiu, Jiarong Xing, Kuo-Feng Hsu, Qiao Kang, Ming G. Liu, S. Narayana, Ang Chen","doi":"10.1145/3477132.3483583","DOIUrl":"https://doi.org/10.1145/3477132.3483583","url":null,"abstract":"The gap between CPU and networking speeds has motivated the development of SmartNICs for NF (network functions) offloading. However, offloading performance is predicated upon intricate knowledge about SmartNIC hardware and careful hand-tuning of the ported programs. Today, developers cannot easily reason about the offloading performance or the effectiveness of different porting strategies without resorting to a trial-and-error approach. Clara is an automated tool that improves the productivity of this workflow by generating offloading insights. Our tool can a) analyze a legacy NF in its unported form, predicting its performance characteristics on a SmartNIC (e.g., compute vs. memory intensity); and b) explore and suggest porting strategies for the given NF to achieve higher performance. To achieve these goals, Clara uses program analysis techniques to extract NF features, and combines them with machine learning techniques to handle opaque SmartNIC details. Our evaluation using Click NF programs on a Netronome Smart-NIC shows that Clara achieves high accuracy in its analysis, and that its suggested porting strategies lead to significant performance improvements.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75782784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Random Walks on Huge Graphs at Cache Efficiency 基于缓存效率的大图随机漫步

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483575

Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, Yongwei Wu

Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.

以随机访问大型工作集为主的数据密集型应用程序无法利用现代处理器的计算能力。图随机游走是许多重要的图处理和学习应用中不可或缺的主力，是这类应用的一个突出例子。现有的图随机漫步系统目前无法匹配gpu侧节点嵌入的训练速度。这项工作揭示了现有的方法不能有效地利用现代CPU内存层次结构，因为人们普遍认为随机漫步的固有随机性和图的倾斜性质使得大多数内存访问是随机的。我们证明，通过仔细划分、重新安排和批处理操作，实际上有大量的空间和时间局部性可以获取。由此产生的系统FlashMob通过使内存访问更加有序和有规律，提高了缓存和内存带宽的利用率。我们还发现，一个经典的组合优化问题(及其精确的伪多项式解)可以应用于复杂的决策制定，以实现准确而有效的数据/任务划分。我们在不同图形上的综合实验表明，我们的系统比现有最快的系统实现了一个数量级的性能改进。它处理58GB的真实图形的每步速度比现有系统处理600KB的玩具图形的速度快。

{"title":"Random Walks on Huge Graphs at Cache Efficiency","authors":"Ke Yang, Xiaosong Ma, Saravanan Thirumuruganathan, Kang Chen, Yongwei Wu","doi":"10.1145/3477132.3483575","DOIUrl":"https://doi.org/10.1145/3477132.3483575","url":null,"abstract":"Data-intensive applications dominated by random accesses to large working sets fail to utilize the computing power of modern processors. Graph random walk, an indispensable workhorse for many important graph processing and learning applications, is one prominent case of such applications. Existing graph random walk systems are currently unable to match the GPU-side node embedding training speed. This work reveals that existing approaches fail to effectively utilize the modern CPU memory hierarchy, due to the widely held assumption that the inherent randomness in random walks and the skewed nature of graphs render most memory accesses random. We demonstrate that there is actually plenty of spatial and temporal locality to harvest, by careful partitioning, rearranging, and batching of operations. The resulting system, FlashMob, improves both cache and memory bandwidth utilization by making memory accesses more sequential and regular. We also found that a classical combinatorial optimization problem (and its exact pseudo-polynomial solution) can be applied to complex decision making, for accurate yet efficient data/task partitioning. Our comprehensive experiments over diverse graphs show that our system achieves an order of magnitude performance improvement over the fastest existing system. It processes a 58GB real graph at higher per-step speed than the existing system on a 600KB toy graph fitting in the L2 cache.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"125 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83306004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

PACTree: A High Performance Persistent Range Index Using PAC Guidelines PACTree:使用PAC指南的高性能持久范围索引

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483589

Wook-Hee Kim, Madhava Krishnan Ramanathan, Xinwei Fu, Sanidhya Kashyap, Changwoo Min

Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM-specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index. We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.

非易失性内存(Non-Volatile Memory, NVM)提供了相对较快的、可字节寻址的持久性，现在已经商业化了。但是，我们不能将真正的NVM等同于慢速DRAM，因为它比我们预期的要复杂得多。在这项工作中，我们重新审视并分析了NVM和NVM特定的持久性内存索引。我们发现，如果考虑NVM硬件、软件堆栈、持久索引设计和并发控制，还有很大的改进空间。根据我们的分析，我们提出了用于设计高性能持久索引结构的打包异步并发(PAC)指南。指导方针背后的关键思想是:1)以打包的方式访问NVM硬件，以最小化其带宽利用率;2)利用异步并发控制，将NVM的长延迟与索引的关键路径解耦。我们开发了PACTree，这是一个遵循PAC指南的高性能持久范围索引。PACTree是一种混合型索引，它对内部节点和B+树状叶节点采用了tree索引。trie索引结构将部分键包在内部节点中。此外，我们解耦了tree索引和B+树状叶节点。解耦允许我们通过异步更新内部节点来防止阻塞并发访问。我们的评估表明，PACTree在性能上比最先进的持久范围索引高出7倍，在99.99%的尾延迟上高出20倍。

{"title":"PACTree: A High Performance Persistent Range Index Using PAC Guidelines","authors":"Wook-Hee Kim, Madhava Krishnan Ramanathan, Xinwei Fu, Sanidhya Kashyap, Changwoo Min","doi":"10.1145/3477132.3483589","DOIUrl":"https://doi.org/10.1145/3477132.3483589","url":null,"abstract":"Non-Volatile Memory (NVM), which provides relatively fast and byte-addressable persistence, is now commercially available. However, we cannot equate a real NVM with a slow DRAM, as it is much more complicated than we expect. In this work, we revisit and analyze both NVM and NVM-specific persistent memory indexes. We find that there is still a lot of room for improvement if we consider NVM hardware, its software stack, persistent index design, and concurrency control. Based on our analysis, we propose Packed Asynchronous Concurrency (PAC) guidelines for designing high-performance persistent index structures. The key idea behind the guidelines is to 1) access NVM hardware in a packed manner to minimize its bandwidth utilization and 2) exploit asynchronous concurrency control to decouple the long NVM latency from the critical path of the index. We develop PACTree, a high-performance persistent range index following the PAC guidelines. PACTree is a hybrid index that employs a trie index for its internal nodes and B+-tree-like leaf nodes. The trie index structure packs partial keys in internal nodes. Moreover, we decouple the trie index and B+-tree-like leaf nodes. The decoupling allows us to prevent blocking concurrent accesses by updating internal nodes asynchronously. Our evaluation shows that PACTree outperforms state-of-the-art persistent range indexes by 7x in performance and 20x in 99.99 percentile tail latency.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80625897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism LineFS:高效的SmartNIC卸载分布式文件系统与流水线并行

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483565

Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, M. Canini, Dejan Kostic, Youngjin Kwon, Simon Peter, E. Witchel

In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.

在多租户系统中，分布式文件系统(dfs)的CPU开销日益成为应用程序性能的负担。CPU和内存干扰会导致应用程序和存储性能下降和不稳定，特别是操作延迟。最近用于持久内存(PM)的客户端本地dfse加速了这一趋势。DFS卸载到SmartNIC是解决这些问题的一个很有前途的解决方案，但是将DFS的复杂需求适应到位于PCIe上的简单SmartNIC处理器上是一个挑战。我们介绍了LineFS，这是一个卸载smartnic的高性能DFS，支持客户端本地PM。为了充分利用SmartNIC架构，我们将DFS操作分解为执行阶段，这些执行阶段可以卸载到SmartNIC上的并行数据路径执行管道上。LineFS将cpu密集型DFS任务(如复制、压缩、数据发布、索引和一致性管理)卸载到Smart-NIC。我们在Mellanox BlueField智能网卡上实现LineFS，并将其与Assise进行比较，Assise是最先进的PM DFS。LineFS将LevelDB中的延迟提高到80%，将Filebench中的吞吐量提高到79%，同时在主机系统故障期间提供扩展的DFS可用性。

{"title":"LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism","authors":"Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, M. Canini, Dejan Kostic, Youngjin Kwon, Simon Peter, E. Witchel","doi":"10.1145/3477132.3483565","DOIUrl":"https://doi.org/10.1145/3477132.3483565","url":null,"abstract":"In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly a burden to application performance. CPU and memory interference cause degraded and unstable application and storage performance, in particular for operation latency. Recent client-local DFSes for persistent memory (PM) accelerate this trend. DFS offload to SmartNICs is a promising solution to these problems, but it is challenging to fit the complex demands of a DFS onto simple SmartNIC processors located across PCIe. We present LineFS, a SmartNIC-offloaded, high-performance DFS with support for client-local PM. To fully leverage the SmartNIC architecture, we decompose DFS operations into execution stages that can be offloaded to a parallel datapath execution pipeline on the SmartNIC. LineFS offloads CPU-intensive DFS tasks, like replication, compression, data publication, index and consistency management to a Smart-NIC. We implement LineFS on the Mellanox BlueField Smart-NIC and compare it to Assise, a state-of-the-art PM DFS. LineFS improves latency in LevelDB up to 80% and throughput in Filebench up to 79%, while providing extended DFS availability during host system failures.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87765290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

CLoF CLoF

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483557

Rafael Lourenco de Lima Chehab, Antonio Paolillo, Diogo Behrens, M. Fu, Hermann Härtig, Haibo Chen

Efficient locking mechanisms are extremely important to support large-scale concurrency and exploit the performance promises of many-core servers. Implementing an efficient, generic, and correct lock is very challenging due to the differences between various NUMA architectures. The performance impact of architectural/NUMA hierarchy differences between x86 and Armv8 are not yet fully explored, leading to unexpected performance when simply porting NUMA-aware locks from x86 to Armv8. Moreover, due to the Armv8 Weak Memory Model (WMM), correctly implementing complicated NUMA-aware locks is very difficult. We propose a Compositional Lock Framework (CLoF) for multi-level NUMA systems. CLoF composes NUMA-oblivious locks in a hierarchy matching the target platform, leading to hundreds of correct by construction NUMA-aware locks. CLoF can automatically select the best lock among them. To show the correctness of CLoF on WMMs, we provide an inductive argument with base and induction steps verified with model checkers. In our evaluation, CLoF locks outperform state-of-the-art NUMA-aware locks in most scenarios, e.g., in a highly contended LevelDB benchmark, our best CLoF locks yield twice the throughput achieved with CNA lock and ShflLock on large x86 and Armv8 servers.

{"title":"CLoF","authors":"Rafael Lourenco de Lima Chehab, Antonio Paolillo, Diogo Behrens, M. Fu, Hermann Härtig, Haibo Chen","doi":"10.1145/3477132.3483557","DOIUrl":"https://doi.org/10.1145/3477132.3483557","url":null,"abstract":"Efficient locking mechanisms are extremely important to support large-scale concurrency and exploit the performance promises of many-core servers. Implementing an efficient, generic, and correct lock is very challenging due to the differences between various NUMA architectures. The performance impact of architectural/NUMA hierarchy differences between x86 and Armv8 are not yet fully explored, leading to unexpected performance when simply porting NUMA-aware locks from x86 to Armv8. Moreover, due to the Armv8 Weak Memory Model (WMM), correctly implementing complicated NUMA-aware locks is very difficult. We propose a Compositional Lock Framework (CLoF) for multi-level NUMA systems. CLoF composes NUMA-oblivious locks in a hierarchy matching the target platform, leading to hundreds of correct by construction NUMA-aware locks. CLoF can automatically select the best lock among them. To show the correctness of CLoF on WMMs, we provide an inductive argument with base and induction steps verified with model checkers. In our evaluation, CLoF locks outperform state-of-the-art NUMA-aware locks in most scenarios, e.g., in a highly contended LevelDB benchmark, our best CLoF locks yield twice the throughput achieved with CNA lock and ShflLock on large x86 and Armv8 servers.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82326025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP 利用POP高效解决大规模粒度资源分配问题

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-22 DOI: 10.1145/3477132.3483588

D. Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, M. Zaharia

Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.

许多计算机系统中的资源分配问题可以表述为数学优化问题。然而，对于具有严格sla的大型问题，使用现成的求解器找到这些问题的精确解决方案通常是难以处理的，这导致系统设计人员依赖于廉价的启发式算法。然而，我们观察到，许多分配问题是细粒度的:它们由大量的客户机和资源组成，每个客户机请求的资源只占资源总数的一小部分，并且客户机可以互换地使用不同的资源。对于这些问题，我们提出了一种替代方法，该方法重用原始的优化问题公式，并比特定领域的启发式方法更好地进行分配。我们的技术，分区优化问题(POP)，将问题随机分割为更小的问题(系统中有客户端和资源的子集)，并将结果子分配合并为所有客户端的全局分配。我们提供理论和经验的证据，为什么随机划分工作良好。在我们的实验中，与现有的集群调度、流量工程和负载平衡系统相比，POP在运行时实现了1.5%的最优分配，并在运行时进行了数量级的改进。

{"title":"Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP","authors":"D. Narayanan, Fiodar Kazhamiaka, Firas Abuzaid, Peter Kraft, Akshay Agrawal, Srikanth Kandula, Stephen P. Boyd, M. Zaharia","doi":"10.1145/3477132.3483588","DOIUrl":"https://doi.org/10.1145/3477132.3483588","url":null,"abstract":"Resource allocation problems in many computer systems can be formulated as mathematical optimization problems. However, finding exact solutions to these problems using off-the-shelf solvers is often intractable for large problem sizes with tight SLAs, leading system designers to rely on cheap, heuristic algorithms. We observe, however, that many allocation problems are granular: they consist of a large number of clients and resources, each client requests a small fraction of the total number of resources, and clients can interchangeably use different resources. For these problems, we propose an alternative approach that reuses the original optimization problem formulation and leads to better allocations than domain-specific heuristics. Our technique, Partitioned Optimization Problems (POP), randomly splits the problem into smaller problems (with a subset of the clients and resources in the system) and coalesces the resulting sub-allocations into a global allocation for all clients. We provide theoretical and empirical evidence as to why random partitioning works well. In our experiments, POP achieves allocations within 1.5% of the optimal with orders-of-magnitude improvements in runtime compared to existing systems for cluster scheduling, traffic engineering, and load balancing.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88552522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Rabia: Simplifying State-Machine Replication Through Randomization Rabia:通过随机化简化状态机复制

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-09-26 DOI: 10.1145/3477132.3483582

Haochen Pan, Jesse Tuglu, Neo Zhou, Tianshu Wang, Yicheng Shen, Xiong Zheng, Joseph Tassarotti, Lewis Tseng, R. Palmieri

We introduce Rabia, a simple and high performance framework for implementing state-machine replication (SMR) within a datacenter. The main innovation of Rabia is in using randomization to simplify the design. Rabia provides the following two features: (i) It does not need any fail-over protocol and supports trivial auxiliary protocols like log compaction, snapshotting, and reconfiguration, components that are often considered the most challenging when developing SMR systems; and (ii) It provides high performance, up to 1.5x higher throughput than the closest competitor (i.e., EPaxos) in a favorable setup (same availability zone with three replicas) and is comparable with a larger number of replicas or when deployed in multiple availability zones.

我们介绍Rabia，这是一个简单的高性能框架，用于在数据中心内实现状态机复制(SMR)。Rabia的主要创新是使用随机化来简化设计。Rabia提供了以下两个特性:(i)它不需要任何故障转移协议，并支持琐碎的辅助协议，如日志压缩、快照和重新配置，这些组件通常被认为是开发SMR系统时最具挑战性的;(ii)它提供高性能，在有利的设置(具有三个副本的相同可用区)中，吞吐量比最接近的竞争对手(即EPaxos)高1.5倍，并且与大量副本或部署在多个可用区时相当。

引用次数: 12

Basil: Breaking up BFT with ACID (transactions) Basil:用ACID(交易)分解BFT

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-09-25 DOI: 10.1145/3477132.3483552

Florian Suri-Payer, Matthew Burke, Zheng Wang, Yunhao Zhang, Lorenzo Alvisi, Natacha Crooks

This paper presents Basil, the first transactional, leaderless Byzantine Fault Tolerant key-value store. Basil leverages ACID transactions to scalably implement the abstraction of a trusted shared log in the presence of Byzantine actors. Unlike traditional BFT approaches, Basil executes non-conflicting operations in parallel and commits transactions in a single round-trip during fault-free executions. Basil improves throughput over traditional BFT systems by four to five times, and is only four times slower than TAPIR, a non-Byzantine replicated system. Basil's novel recovery mechanism further minimizes the impact of failures: with 30% Byzantine clients, throughput drops by less than 25% in the worst-case.

本文介绍了Basil，第一个事务性的、无领导的拜占庭式容错键值存储。Basil利用ACID事务在拜占庭参与者存在的情况下可伸缩地实现可信共享日志的抽象。与传统的BFT方法不同，Basil并行执行无冲突的操作，并在无故障执行期间在单次往返中提交事务。Basil将传统BFT系统的吞吐量提高了4到5倍，仅比TAPIR慢4倍，TAPIR是一种非拜占庭式复制系统。Basil的新型恢复机制进一步减少了故障的影响:在30%的拜占庭客户端情况下，吞吐量在最坏情况下下降不到25%。

引用次数: 55

Regular Sequential Serializability and Regular Sequential Consistency 常规顺序可序列化性和常规顺序一致性

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-09-18 DOI: 10.1145/3477132.3483566

Jeffrey Helt, Matthew Burke, A. Levy, Wyatt Lloyd

Strictly serializable (linearizable) services appear to execute transactions (operations) sequentially, in an order consistent with real time. This restricts a transaction's (operation's) possible return values and in turn, simplifies application programming. In exchange, strictly serializable (linearizable) services perform worse than those with weaker consistency. But switching to such services can break applications. This work introduces two new consistency models to ease this trade-off: regular sequential serializability (RSS) and regular sequential consistency (RSC). They are just as strong for applications: we prove any application invariant that holds when using a strictly serializable (linearizable) service also holds when using an RSS (RSC) service. Yet they relax the constraints on services---they allow new, better-performing designs. To demonstrate this, we design, implement, and evaluate variants of two systems, Spanner and Gryff, relaxing their consistency to RSS and RSC, respectively. The new variants achieve better read-only transaction and read tail latency than their counterparts.

严格序列化(线性化)的服务似乎按顺序执行事务(操作)，其顺序与实时一致。这限制了事务(操作)可能的返回值，从而简化了应用程序编程。作为交换，严格序列化(线性化)的服务比一致性较弱的服务性能更差。但是切换到这样的服务可能会破坏应用程序。这项工作引入了两个新的一致性模型来缓解这种权衡:常规顺序序列化性(RSS)和常规顺序一致性(RSC)。它们对于应用程序同样强大:我们证明了当使用严格序列化(线性化)服务时成立的任何应用程序不变量在使用RSS (RSC)服务时也成立。然而，它们放宽了对服务的限制——它们允许新的、性能更好的设计。为了证明这一点，我们设计、实现和评估了两个系统Spanner和Gryff的变体，分别放宽了它们与RSS和RSC的一致性。新的变体实现了更好的只读事务和读尾延迟。

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Operating Systems Review (ACM)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀