Operating Systems Review (ACM)最新文献_第5页

HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM HeMem:面向大数据应用和真实NVM的可扩展分层内存管理

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483550

Amanda Raybuck, Tim Stamler, Wei Zhang, M. Erez, Simon Peter

High-capacity non-volatile memory (NVM) is a new main memory tier. Tiered DRAM+NVM servers increase total memory capacity by up to 8x, but can diminish memory bandwidth by up to 7x and inflate latency by up to 63% if not managed well. We study existing hardware and software tiered memory management systems on the recently available Intel Optane DC NVM with big data applications and find that no existing system maximizes application performance on real NVM. Based on our findings, we present HeMem, a tiered main memory management system designed from scratch for commercially available NVM and the big data applications that use it. HeMem manages tiered memory asynchronously, batching and amortizing memory access tracking, migration, and associated TLB synchronization overheads. HeMem monitors application memory use by sampling memory access via CPU events, rather than page tables. This allows HeMem to scale to terabytes of memory, keeping small and ephemeral data structures in fast memory, and allocating scarce, asymmetric NVM bandwidth according to access patterns. Finally, HeMem is flexible by placing per-application memory management policy at user-level. On a system with Intel Optane DC NVM, HeMem outperforms hardware, OS, and PL-based tiered memory management, providing up to 50% runtime reduction for the GAP graph processing benchmark, 13% higher throughput for TPC-C on the Silo in-memory database, 16% lower tail-latency under performance isolation for a key-value store, and up to 10x less NVM wear than the next best solution, without application modification.

高容量非易失性内存(High-capacity non-volatile memory, NVM)是一种新的主存层。分层DRAM+NVM服务器将总内存容量提高了8倍，但如果管理不善，可能会将内存带宽减少7倍，并使延迟增加63%。我们在最新的Intel Optane DC NVM上研究了现有的硬件和软件分层内存管理系统与大数据应用程序，发现没有现有的系统在真实的NVM上最大化应用程序性能。基于我们的研究结果，我们提出了HeMem，这是一个为商用NVM和使用它的大数据应用程序从头设计的分层主内存管理系统。HeMem异步管理分层内存、批处理和分摊内存访问跟踪、迁移和相关的TLB同步开销。HeMem通过CPU事件(而不是页表)对内存访问进行抽样来监视应用程序内存使用情况。这使得HeMem可以扩展到tb级内存，在快速内存中保持小型和短暂的数据结构，并根据访问模式分配稀缺的非对称NVM带宽。最后，HeMem通过将每个应用程序的内存管理策略放在用户级别来实现灵活性。在使用Intel Optane DC NVM的系统上，HeMem优于基于硬件、操作系统和基于pl的分层内存管理，为GAP图形处理基准提供高达50%的运行时间减少，在Silo内存数据库上为TPC-C提供13%的吞吐量提高，在性能隔离下为键值存储降低16%的延迟，并且在不修改应用程序的情况下，NVM损耗比下一个最佳解决方案减少高达10倍。

{"title":"HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM","authors":"Amanda Raybuck, Tim Stamler, Wei Zhang, M. Erez, Simon Peter","doi":"10.1145/3477132.3483550","DOIUrl":"https://doi.org/10.1145/3477132.3483550","url":null,"abstract":"High-capacity non-volatile memory (NVM) is a new main memory tier. Tiered DRAM+NVM servers increase total memory capacity by up to 8x, but can diminish memory bandwidth by up to 7x and inflate latency by up to 63% if not managed well. We study existing hardware and software tiered memory management systems on the recently available Intel Optane DC NVM with big data applications and find that no existing system maximizes application performance on real NVM. Based on our findings, we present HeMem, a tiered main memory management system designed from scratch for commercially available NVM and the big data applications that use it. HeMem manages tiered memory asynchronously, batching and amortizing memory access tracking, migration, and associated TLB synchronization overheads. HeMem monitors application memory use by sampling memory access via CPU events, rather than page tables. This allows HeMem to scale to terabytes of memory, keeping small and ephemeral data structures in fast memory, and allocating scarce, asymmetric NVM bandwidth according to access patterns. Finally, HeMem is flexible by placing per-application memory management policy at user-level. On a system with Intel Optane DC NVM, HeMem outperforms hardware, OS, and PL-based tiered memory management, providing up to 50% runtime reduction for the GAP graph processing benchmark, 13% higher throughput for TPC-C on the Silo in-memory database, 16% lower tail-latency under performance isolation for a key-value store, and up to 10x less NVM wear than the next best solution, without application modification.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"95 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90949661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Xenic: SmartNIC-Accelerated Distributed Transactions Xenic: smartnic加速分布式事务

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483555

Henry N. Schuh, Weihao Liang, Ming G. Liu, J. Nelson, A. Krishnamurthy

High-performance distributed transactions require efficient remote operations on database memory and protocol metadata. The high communication cost of this workload calls for hardware acceleration. Recent research has applied RDMA to this end, leveraging the network controller to manipulate host memory without consuming CPU cycles on the target server. However, the basic read/write RDMA primitives demand trade-offs in data structure and protocol design, limiting their benefits. SmartNICs are a flexible alternative for fast distributed transactions, adding programmable compute cores and on-board memory to the network interface. Applying measured performance characteristics, we design Xenic, a SmartNIC-optimized transaction processing system. Xenic applies an asynchronous, aggregated execution model to maximize network and core efficiency. Xenic's co-designed data store achieves low-overhead remote object accesses. Additionally, Xenic uses flexible, point-to-point communication patterns between SmartNICs to minimize transaction commit latency. We compare Xenic against prior RDMA- and RPC-based transaction systems with the TPC-C, Retwis, and Smallbank benchmarks. Our results for the three benchmarks show 2.42x, 2.07x, and 2.21x throughput improvement, 59%, 42%, and 22% latency reduction, while saving 2.3, 8.1, and 10.1 threads per server.

高性能分布式事务需要对数据库内存和协议元数据进行高效的远程操作。这种工作负载的高通信成本要求硬件加速。最近的研究将RDMA应用于这一目的，利用网络控制器来操纵主机内存，而不消耗目标服务器上的CPU周期。然而，基本的读/写RDMA原语需要在数据结构和协议设计方面进行权衡，从而限制了它们的好处。smartnic是快速分布式事务的灵活替代方案，为网络接口添加了可编程计算核心和板载内存。应用测量的性能特征，我们设计了Xenic，一个smartnic优化的事务处理系统。Xenic应用异步聚合执行模型来最大化网络和核心效率。Xenic共同设计的数据存储实现了低开销的远程对象访问。此外，Xenic在smartnic之间使用灵活的点对点通信模式，以最大限度地减少事务提交延迟。我们将Xenic与之前基于RDMA和rpc的交易系统与TPC-C、Retwis和Smallbank基准进行比较。我们对三个基准测试的结果显示，吞吐量提高了2.42倍、2.07倍和2.21倍，延迟减少了59%、42%和22%，同时每个服务器节省了2.3、8.1和10.1个线程。

{"title":"Xenic: SmartNIC-Accelerated Distributed Transactions","authors":"Henry N. Schuh, Weihao Liang, Ming G. Liu, J. Nelson, A. Krishnamurthy","doi":"10.1145/3477132.3483555","DOIUrl":"https://doi.org/10.1145/3477132.3483555","url":null,"abstract":"High-performance distributed transactions require efficient remote operations on database memory and protocol metadata. The high communication cost of this workload calls for hardware acceleration. Recent research has applied RDMA to this end, leveraging the network controller to manipulate host memory without consuming CPU cycles on the target server. However, the basic read/write RDMA primitives demand trade-offs in data structure and protocol design, limiting their benefits. SmartNICs are a flexible alternative for fast distributed transactions, adding programmable compute cores and on-board memory to the network interface. Applying measured performance characteristics, we design Xenic, a SmartNIC-optimized transaction processing system. Xenic applies an asynchronous, aggregated execution model to maximize network and core efficiency. Xenic's co-designed data store achieves low-overhead remote object accesses. Additionally, Xenic uses flexible, point-to-point communication patterns between SmartNICs to minimize transaction commit latency. We compare Xenic against prior RDMA- and RPC-based transaction systems with the TPC-C, Retwis, and Smallbank benchmarks. Our results for the three benchmarks show 2.42x, 2.07x, and 2.21x throughput improvement, 59%, 42%, and 22% latency reduction, while saving 2.3, 8.1, and 10.1 threads per server.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"39 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84697206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Kauri

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483584

Ray Neiheiser, M. Matos, Luís Rodrigues

With the growing commercial interest in blockchains, permissioned implementations have received increasing attention. Unfortunately, the BFT consensus algorithms that are the backbone of most of these blockchains scale poorly and offer limited throughput. Many state-of-the-art algorithms require a single leader process to receive and validate votes from a quorum of processes and then broadcast the result, which is inherently non-scalable. Recent approaches avoid this bottleneck by using dissemination/aggregation trees to propagate values and collect and validate votes. However, the use of trees increases the round latency, which ultimately limits the throughput for deeper trees. In this paper we propose Kauri, a BFT communication abstraction that can sustain high throughput as the system size grows, leveraging a novel pipelining technique to perform scalable dissemination and aggregation on trees. Our evaluation shows that Kauri outperforms the throughput of state-of-the-art permissioned blockchain protocols, such as HotStuff, by up to 28x. Interestingly, in many scenarios, the parallelization provided by Kauri can also decrease the latency.

引用次数: 34

RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation RAS:持续优化的全区域数据中心资源分配

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483578

Andrew Newell, Dimitrios Skarlatos, Jingyuan Fan, Pavan Kumar, Maxim Khutornenko, Mayank Pundir, Yirui Zhang, Mingjun Zhang, Yuanlai Liu, Linh Le, Brendon Daugherty, Apurva Samudra, Prashasti Baid, James Kneeland, Igor Kabiljo, Dmitry Shchukin, André Rodrigues, S. Michelson, B. Christensen, K. Veeraraghavan, Chunqiang Tang

Capacity reservation is a common offering in public clouds and on-premise infrastructure. However, no prior work provides capacity reservation with SLO guarantees that takes into account random and correlated hardware failures, datacenter maintenance, and heterogeneous hardware. In this paper, we describe how Facebook's region-scale Resource Allowance System (RAS) addresses these issues and provides guaranteed capacity. RAS uses a capacity abstraction called reservation to represent a set of servers dynamically assigned to a logical cluster. We take a two-level approach to scale resource allocation to all datacenters in a region, where a mixed-integer-programming solver continuously optimizes server-to-reservation assignments off the critical path, and a traditional container allocator does real-time placement of containers on servers in a reservation. As a relatively new component of Facebook's 10-year old cluster manager Twine, RAS has been running in production for almost two years, continuously optimizing the allocation of millions of servers to thousands of reservations. We describe the design of RAS and share our experience of deploying it at scale.

容量预留是公共云和内部部署基础设施中的常见产品。但是，之前的工作没有提供容量预留和考虑随机和相关硬件故障、数据中心维护和异构硬件的SLO保证。在本文中，我们描述了Facebook的区域规模资源补贴系统(RAS)如何解决这些问题并提供保证容量。RAS使用称为保留的容量抽象来表示动态分配给逻辑集群的一组服务器。我们采用两级方法将资源分配扩展到一个区域内的所有数据中心，其中混合整数规划求解器不断优化关键路径以外的服务器到预订的分配，而传统的容器分配器在预订中的服务器上实时放置容器。作为Facebook已有10年历史的集群管理器Twine的一个相对较新的组件，RAS已经在生产环境中运行了近两年，不断优化数百万台服务器到数千个预订的分配。我们描述了RAS的设计，并分享了我们大规模部署RAS的经验。

{"title":"RAS: Continuously Optimized Region-Wide Datacenter Resource Allocation","authors":"Andrew Newell, Dimitrios Skarlatos, Jingyuan Fan, Pavan Kumar, Maxim Khutornenko, Mayank Pundir, Yirui Zhang, Mingjun Zhang, Yuanlai Liu, Linh Le, Brendon Daugherty, Apurva Samudra, Prashasti Baid, James Kneeland, Igor Kabiljo, Dmitry Shchukin, André Rodrigues, S. Michelson, B. Christensen, K. Veeraraghavan, Chunqiang Tang","doi":"10.1145/3477132.3483578","DOIUrl":"https://doi.org/10.1145/3477132.3483578","url":null,"abstract":"Capacity reservation is a common offering in public clouds and on-premise infrastructure. However, no prior work provides capacity reservation with SLO guarantees that takes into account random and correlated hardware failures, datacenter maintenance, and heterogeneous hardware. In this paper, we describe how Facebook's region-scale Resource Allowance System (RAS) addresses these issues and provides guaranteed capacity. RAS uses a capacity abstraction called reservation to represent a set of servers dynamically assigned to a logical cluster. We take a two-level approach to scale resource allocation to all datacenters in a region, where a mixed-integer-programming solver continuously optimizes server-to-reservation assignments off the critical path, and a traditional container allocator does real-time placement of containers on servers in a reservation. As a relatively new component of Facebook's 10-year old cluster manager Twine, RAS has been running in production for almost two years, continuously optimizing the allocation of millions of servers to thousands of reservations. We describe the design of RAS and share our experience of deploying it at scale.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78583475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge bladerrunner:流处理在边缘的后端数据变化的实时视图的规模

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483572

Jeffrey A. Barber, Ximing Yu, Laney Kuenzel Zamore, Jerry Lin, Vahid Jazayeri, Shie S. Erlich, T. Savor, M. Stumm

Consider a social media platform with hundreds of millions of online users at any time, utilizing a social graph that has many billions of nodes and edges. The problem this paper addresses is how to provide each user a continuously fresh, up-to-date view of the parts of the social graph they are currently interested in, so as to provide a positive interactive user experience. The problem is challenging because the social graph mutates at a high rate, users change their focus of interest frequently, and some mutations are of interest to many online users. We describe Bladerunner, a system we use at Facebook to deliver relevant social graph updates to user devices efficiently and quickly. The heart of Bladerunner is a set of back-end stream processors that obtain streams of social graph updates and process them on a per application and per-user basis before pushing selected updates to user devices. Separate stream processors are used for each application to enable application-specific customization, complex filtering, aggregation and other message delivery operations on a per-user basis. This strategy minimizes device processing overhead and last-mile bandwidth usage, which are critical given that users are mostly on mobile devices.

考虑一个随时拥有数亿在线用户的社交媒体平台，利用一个拥有数十亿节点和边的社交图。本文所要解决的问题是如何为每个用户提供他们当前感兴趣的社交图谱部分的持续新鲜，最新的视图，从而提供积极的交互式用户体验。这个问题是具有挑战性的，因为社交图谱的变异率很高，用户经常改变他们的兴趣焦点，有些变异是许多在线用户感兴趣的。我们描述了bladerrunner，这是我们在Facebook使用的一个系统，它可以高效、快速地向用户设备发送相关的社交图谱更新。bladerrunner的核心是一组后端流处理器，它们获取社交图谱更新流，并在将选定的更新推送到用户设备之前，以每个应用程序和每个用户为基础对其进行处理。每个应用程序使用单独的流处理器，以支持特定于应用程序的定制、复杂的过滤、聚合和其他基于每个用户的消息传递操作。这一策略最大限度地减少了设备处理开销和最后一英里带宽的使用，考虑到用户主要使用移动设备，这一点至关重要。

{"title":"Bladerunner: Stream Processing at Scale for a Live View of Backend Data Mutations at the Edge","authors":"Jeffrey A. Barber, Ximing Yu, Laney Kuenzel Zamore, Jerry Lin, Vahid Jazayeri, Shie S. Erlich, T. Savor, M. Stumm","doi":"10.1145/3477132.3483572","DOIUrl":"https://doi.org/10.1145/3477132.3483572","url":null,"abstract":"Consider a social media platform with hundreds of millions of online users at any time, utilizing a social graph that has many billions of nodes and edges. The problem this paper addresses is how to provide each user a continuously fresh, up-to-date view of the parts of the social graph they are currently interested in, so as to provide a positive interactive user experience. The problem is challenging because the social graph mutates at a high rate, users change their focus of interest frequently, and some mutations are of interest to many online users. We describe Bladerunner, a system we use at Facebook to deliver relevant social graph updates to user devices efficiently and quickly. The heart of Bladerunner is a set of back-end stream processors that obtain streams of social graph updates and process them on a per application and per-user basis before pushing selected updates to user devices. Separate stream processors are used for each application to enable application-specific customization, complex filtering, aggregation and other message delivery operations on a per-user basis. This strategy minimizes device processing overhead and last-mile bandwidth usage, which are critical given that users are mostly on mobile devices.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79957312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Bidl: A High-throughput, Low-latency Permissioned Blockchain Framework for Datacenter Networks Bidl:用于数据中心网络的高吞吐量，低延迟许可区块链框架

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483574

Ji Qi, Xusheng Chen, Yunpeng Jiang, Jianyu Jiang, Tianxiang Shen, Shixiong Zhao, Sen Wang, Gong Zhang, Li Chen, M. Au, Heming Cui

A permissioned blockchain framework typically runs an efficient Byzantine consensus protocol and is attractive to deploy fast trading applications among a large number of mutually untrusted participants (e.g., companies). Unfortunately, all existing permissioned blockchain frameworks adopt sequential workflows for invoking the consensus protocol and executing applications' transactions, making the performance of these applications much lower than deploying them in traditional systems (e.g., in-datacenter stock exchange). We propose Bidl, the first permissioned blockchain framework highly optimized for datacenter networks. We leverage the network ordering in such networks to create a shepherded parallel workflow, which carries a sequencer to parallelize the consensus protocol and transaction execution speculatively. However, the presence of malicious participants (e.g., a malicious sequencer) can easily perturb the parallel workflow to greatly degrade Bidl's performance. To achieve stable high performance, Bidl efficiently shepherds all participants by detecting their misbehaviors, and performs denylist-based view changes to replace or deny malicious participants. Compared with three fast permissioned blockchain frameworks, Bidl's parallel workflow reduces applications' latency by up to 72.7% and improves their throughput by up to 4.3x in the presence of malicious participants. Bidl is suitable to be integrated with traditional stock exchange systems. Bidl's code is released on github.com/hku-systems/bidl.

许可的区块链框架通常运行高效的拜占庭共识协议，并且对于在大量相互不信任的参与者(例如公司)之间部署快速交易应用程序具有吸引力。不幸的是，所有现有的许可区块链框架都采用顺序工作流来调用共识协议和执行应用程序的事务，这使得这些应用程序的性能远远低于将它们部署在传统系统中(例如，数据中心内的股票交易所)。我们提出了Bidl，这是第一个针对数据中心网络高度优化的许可区块链框架。我们在这样的网络中利用网络排序来创建一个被引导的并行工作流，该工作流携带一个序列器来推测并行化共识协议和事务执行。然而，恶意参与者(例如，恶意排序器)的存在很容易干扰并行工作流，从而大大降低并行工作流的性能。为了获得稳定的高性能，Bidl通过检测所有参与者的错误行为来有效地引导所有参与者，并执行基于denylist的视图更改来替换或拒绝恶意参与者。与三个快速许可的区块链框架相比，Bidl的并行工作流将应用程序的延迟减少了72.7%，并在存在恶意参与者的情况下将其吞吐量提高了4.3倍。Bidl适合与传统的证券交易系统集成。Bidl的代码发布在github.com/hku-systems/bidl上。

{"title":"Bidl: A High-throughput, Low-latency Permissioned Blockchain Framework for Datacenter Networks","authors":"Ji Qi, Xusheng Chen, Yunpeng Jiang, Jianyu Jiang, Tianxiang Shen, Shixiong Zhao, Sen Wang, Gong Zhang, Li Chen, M. Au, Heming Cui","doi":"10.1145/3477132.3483574","DOIUrl":"https://doi.org/10.1145/3477132.3483574","url":null,"abstract":"A permissioned blockchain framework typically runs an efficient Byzantine consensus protocol and is attractive to deploy fast trading applications among a large number of mutually untrusted participants (e.g., companies). Unfortunately, all existing permissioned blockchain frameworks adopt sequential workflows for invoking the consensus protocol and executing applications' transactions, making the performance of these applications much lower than deploying them in traditional systems (e.g., in-datacenter stock exchange). We propose Bidl, the first permissioned blockchain framework highly optimized for datacenter networks. We leverage the network ordering in such networks to create a shepherded parallel workflow, which carries a sequencer to parallelize the consensus protocol and transaction execution speculatively. However, the presence of malicious participants (e.g., a malicious sequencer) can easily perturb the parallel workflow to greatly degrade Bidl's performance. To achieve stable high performance, Bidl efficiently shepherds all participants by detecting their misbehaviors, and performs denylist-based view changes to replace or deny malicious participants. Compared with three fast permissioned blockchain frameworks, Bidl's parallel workflow reduces applications' latency by up to 72.7% and improves their throughput by up to 4.3x in the presence of malicious participants. Bidl is suitable to be integrated with traditional stock exchange systems. Bidl's code is released on github.com/hku-systems/bidl.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"90 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81549766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

MIND 心

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483561

To begin with the psychologists have not yet made it clear what Mind is. I do not mean its substratum; but they have not even made it clear what a psychical phenomenon is. Far less has any notion of mind been established and generally acknowledged which can compare for an instant in distinctness to the dynamical conception of matter. Almost all the psychologists still tell us that mind is consciousness. But to my apprehension Hartmann has proved conclusively that unconscious mind exists. What is meant by consciousness is really in itself nothing but feeling. Gay and Hartley were quite right about that; and though there may be, and probably is, something of the general nature of feeling almost everywhere, yet feeling in any ascertainable degree is a mere property of protoplasm, perhaps only of nerve matter. Now it so happens that biological organisms, and especially a nervous system are favorably conditioned for exhibiting the phenomena of mind also; and therefore it is not surprising that mind and feeling should be confounded. But I do not believe that psychology can be set to rights until the importance of Hartmann’s argument is acknowledged, and it is seen that feeling is nothing but the inward aspect of things, while mind on the contrary is essentially an external phenomenon. The error is very much like that which was so long prevalent that an electrical current moved through the metallic wire; while it is now known that that is just the only place from which it is cut off, being wholly external to the wire. Again, the psychologists undertake to locate various mental powers in the brain; and above all consider it as quite certain that the faculty of language resides in a certain lobe; but I believe it comes decidedly nearer the truth (though not really true) that language resides in the tongue. In my opinion it is much more true that the thoughts of a living writer are in any printed copy of his book than that they are in his brain.

首先，心理学家还没有弄清楚什么是心灵。我指的不是它的底层;但他们甚至没有弄清楚什么是心理现象。关于精神的任何概念，只要能在顷刻间的清晰性上与物质的动力学概念相比较，就已经建立起来并得到普遍承认了。几乎所有的心理学家仍然告诉我们，心灵就是意识。但据我所知，哈特曼已经确凿地证明了潜意识的存在。所谓意识，其实本身不过是感觉而已。盖伊和哈特利在这一点上说得很对;虽然几乎在任何地方都可能存在，而且很可能确实存在某种感觉的一般性质，然而，在任何可确定的程度上，感觉仅仅是原生质的一种特性，也许只是神经物质的一种特性。生物有机体，尤其是神经系统恰好也具备表现心理现象的有利条件;因此，精神和感情被混淆也就不足为奇了。但是，我不相信心理学可以被纠正，除非承认哈特曼的论点的重要性，并且看到感觉只不过是事物的内在方面，而相反，精神本质上是一种外部现象。这种误差很像长期以来普遍存在的误差，电流通过金属导线;但现在我们知道，这只是它完全在电线外面被切断的唯一地方。再一次，心理学家承诺在大脑中定位各种精神力量;最重要的是，可以肯定的是，语言的能力存在于某个脑叶中;但我相信，语言存在于舌头中，这显然更接近真理(尽管并非完全正确)。在我看来，一个活着的作家的思想存在于他的书的任何印刷版本中，而不是存在于他的大脑中，这一点更为真实。

{"title":"MIND","authors":"","doi":"10.1145/3477132.3483561","DOIUrl":"https://doi.org/10.1145/3477132.3483561","url":null,"abstract":"To begin with the psychologists have not yet made it clear what Mind is. I do not mean its substratum; but they have not even made it clear what a psychical phenomenon is. Far less has any notion of mind been established and generally acknowledged which can compare for an instant in distinctness to the dynamical conception of matter. Almost all the psychologists still tell us that mind is consciousness. But to my apprehension Hartmann has proved conclusively that unconscious mind exists. What is meant by consciousness is really in itself nothing but feeling. Gay and Hartley were quite right about that; and though there may be, and probably is, something of the general nature of feeling almost everywhere, yet feeling in any ascertainable degree is a mere property of protoplasm, perhaps only of nerve matter. Now it so happens that biological organisms, and especially a nervous system are favorably conditioned for exhibiting the phenomena of mind also; and therefore it is not surprising that mind and feeling should be confounded. But I do not believe that psychology can be set to rights until the importance of Hartmann’s argument is acknowledged, and it is seen that feeling is nothing but the inward aspect of things, while mind on the contrary is essentially an external phenomenon. The error is very much like that which was so long prevalent that an electrical current moved through the metallic wire; while it is now known that that is just the only place from which it is cut off, being wholly external to the wire. Again, the psychologists undertake to locate various mental powers in the brain; and above all consider it as quite certain that the faculty of language resides in a certain lobe; but I believe it comes decidedly nearer the truth (though not really true) that language resides in the tongue. In my opinion it is much more true that the thoughts of a living writer are in any printed copy of his book than that they are in his brain.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82122672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware Arm宽松内存硬件上多处理器Hypervisor的正式验证

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483560

Runzhou Tao, Jianan Yao, Xupeng Li, Shih-wei Li, Jason Nieh, Ronghui Gu

Concurrent systems software is widely-used, complex, and error-prone, posing a significant security risk. We introduce VRM, a new framework that makes it possible for the first time to verify concurrent systems software, such as operating systems and hypervisors, on Arm relaxed memory hardware. VRM defines a set of synchronization and memory access conditions such that a program that satisfies these conditions can be mostly verified on a sequentially consistent hardware model and the proofs will automatically hold on relaxed memory hardware. VRM can be used to verify concurrent kernel code that is not data race free, including code responsible for managing shared page tables in the presence of relaxed MMU hardware. Using VRM, we verify the security guarantees of a retrofitted implementation of the Linux KVM hypervisor on Arm. For multiple versions of KVM, we prove KVM's security properties on a sequentially consistent model, then prove that KVM satisfies VRM's required program conditions such that its security proofs hold on Arm relaxed memory hardware. Our experimental results show that the retrofit and VRM conditions do not adversely affect the scalability of verified KVM, as it performs similar to unmodified KVM when concurrently running many multiprocessor virtual machines with real application workloads on Arm multiprocessor server hardware. Our work is the first machine-checked proof for concurrent systems software on Arm relaxed memory hardware.

并发系统软件应用广泛，复杂且容易出错，存在很大的安全风险。我们介绍了VRM，这是一个新的框架，首次可以在Arm宽松内存硬件上验证并发系统软件，如操作系统和管理程序。VRM定义了一组同步和内存访问条件，这样满足这些条件的程序可以在顺序一致的硬件模型上进行验证，并且证明将自动保留在宽松的内存硬件上。VRM可以用来验证不存在数据竞争的并发内核代码，包括在宽松的MMU硬件存在下负责管理共享页表的代码。使用VRM，我们验证了Linux KVM hypervisor在Arm上的改进实现的安全保证。对于多个版本的KVM，我们在顺序一致的模型上证明了KVM的安全特性，然后证明KVM满足VRM所需的程序条件，使得其安全证明在Arm宽松内存硬件上成立。我们的实验结果表明，改进和VRM条件不会对经过验证的KVM的可扩展性产生不利影响，因为当在Arm多处理器服务器硬件上并发运行具有实际应用工作负载的多个多处理器虚拟机时，它的性能与未修改的KVM相似。我们的工作是Arm放松内存硬件上并发系统软件的第一个机器检查证明。

{"title":"Formal Verification of a Multiprocessor Hypervisor on Arm Relaxed Memory Hardware","authors":"Runzhou Tao, Jianan Yao, Xupeng Li, Shih-wei Li, Jason Nieh, Ronghui Gu","doi":"10.1145/3477132.3483560","DOIUrl":"https://doi.org/10.1145/3477132.3483560","url":null,"abstract":"Concurrent systems software is widely-used, complex, and error-prone, posing a significant security risk. We introduce VRM, a new framework that makes it possible for the first time to verify concurrent systems software, such as operating systems and hypervisors, on Arm relaxed memory hardware. VRM defines a set of synchronization and memory access conditions such that a program that satisfies these conditions can be mostly verified on a sequentially consistent hardware model and the proofs will automatically hold on relaxed memory hardware. VRM can be used to verify concurrent kernel code that is not data race free, including code responsible for managing shared page tables in the presence of relaxed MMU hardware. Using VRM, we verify the security guarantees of a retrofitted implementation of the Linux KVM hypervisor on Arm. For multiple versions of KVM, we prove KVM's security properties on a sequentially consistent model, then prove that KVM satisfies VRM's required program conditions such that its security proofs hold on Arm relaxed memory hardware. Our experimental results show that the retrofit and VRM conditions do not adversely affect the scalability of verified KVM, as it performs similar to unmodified KVM when concurrently running many multiprocessor virtual machines with real application workloads on Arm multiprocessor server hardware. Our work is the first machine-checked proof for concurrent systems software on Arm relaxed memory hardware.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87882752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

PRISM: Rethinking the RDMA Interface for Distributed Systems PRISM:重新思考分布式系统的RDMA接口

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483587

Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, J. Nelson, Irene Zhang, Dan R. K. Ports

Remote Direct Memory Access (RDMA) has been used to accelerate a variety of distributed systems, by providing low-latency, CPU-bypassing access to a remote host's memory. However, most of the distributed protocols used in these systems cannot easily be expressed in terms of the simple memory READs and WRITEs provided by RDMA. As a result, designers face a choice between introducing additional protocol complexity (e.g., additional round trips) or forgoing the benefits of RDMA entirely. This paper argues that an extension to the RDMA interface can resolve this dilemma. We introduce the PRISM interface, which adds four new primitives: indirection, allocation, enhanced compare-and-swap, and operation chaining. These increase the expressivity of the RDMA interface, while still being implementable using the same underlying hardware features. We show their utility by designing three new applications using PRISM primitives, that require little to no server-side CPU involvement: (1) PRISM-KV, a key-value store; (2) PRISM-RS, a replicated block store; and (3) PRISM-TX, a distributed transaction protocol. Using a software-based implementation of the PRISM primitives, we show that these systems outperform prior RDMA-based equivalents.

远程直接内存访问(RDMA)通过提供对远程主机内存的低延迟、绕过cpu的访问，已被用于加速各种分布式系统。然而，在这些系统中使用的大多数分布式协议不能很容易地用RDMA提供的简单内存读和写来表示。因此，设计人员面临着一个选择，是引入额外的协议复杂性(例如，额外的往返)，还是完全放弃RDMA的好处。本文认为对RDMA接口的扩展可以解决这一难题。我们引入PRISM接口，它增加了四个新的原语:间接、分配、增强的比较与交换和操作链。这些特性增加了RDMA接口的表现力，同时仍然可以使用相同的底层硬件特性来实现。我们通过设计三个使用PRISM原语的新应用程序来展示它们的实用性，这些应用程序几乎不需要服务器端CPU的参与:(1)PRISM- kv，一个键值存储;(2)复制块存储PRISM-RS;(3)分布式事务协议PRISM-TX。使用PRISM原语的基于软件的实现，我们展示了这些系统优于先前基于rdma的等效系统。

{"title":"PRISM: Rethinking the RDMA Interface for Distributed Systems","authors":"Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, J. Nelson, Irene Zhang, Dan R. K. Ports","doi":"10.1145/3477132.3483587","DOIUrl":"https://doi.org/10.1145/3477132.3483587","url":null,"abstract":"Remote Direct Memory Access (RDMA) has been used to accelerate a variety of distributed systems, by providing low-latency, CPU-bypassing access to a remote host's memory. However, most of the distributed protocols used in these systems cannot easily be expressed in terms of the simple memory READs and WRITEs provided by RDMA. As a result, designers face a choice between introducing additional protocol complexity (e.g., additional round trips) or forgoing the benefits of RDMA entirely. This paper argues that an extension to the RDMA interface can resolve this dilemma. We introduce the PRISM interface, which adds four new primitives: indirection, allocation, enhanced compare-and-swap, and operation chaining. These increase the expressivity of the RDMA interface, while still being implementable using the same underlying hardware features. We show their utility by designing three new applications using PRISM primitives, that require little to no server-side CPU involvement: (1) PRISM-KV, a key-value store; (2) PRISM-RS, a replicated block store; and (3) PRISM-TX, a distributed transaction protocol. Using a software-based implementation of the PRISM primitives, we show that these systems outperform prior RDMA-based equivalents.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88306377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Coeus: A System for Oblivious Document Ranking and Retrieval Coeus:一个遗忘文档排序与检索系统

Q3 Computer Science

Operating Systems Review (ACM)

Pub Date : 2021-10-26 DOI: 10.1145/3477132.3483586

Ishtiyaque Ahmad, Laboni Sarker, D. Agrawal, A. E. Abbadi, Trinabh Gupta

Given a private string q and a remote server that holds a set of public documents D, how can one of the K most relevant documents to q in D be selected and viewed without anyone (not even the server) learning anything about q or the document? This is the oblivious document ranking and retrieval problem. In this paper, we describe Coeus, a system that solves this problem. At a high level, Coeus composes two cryptographic primitives: secure matrix-vector product for scoring document relevance using the widely-used term frequency-inverse document frequency (tf-idf) method, and private information retrieval (PIR) for obliviously retrieving documents. However, Coeus reduces the time to run these protocols, thereby improving the user-perceived latency, which is a key performance metric. Coeus first reduces the PIR overhead by separating out private metadata retrieval from document retrieval, and it then scales secure matrix-vector product to tf-idf matrices with several hundred billion elements through a series of novel cryptographic refinements. For a corpus of English Wikipedia containing 5 million documents, a keyword dictionary with 64K keywords, and on a cluster of 143 machines on AWS, Coeus enables a user to obliviously rank and retrieve a document in 3.9 seconds---a 24x improvement over a baseline system.

给定一个私有字符串q和一个持有一组公共文档D的远程服务器，如何在没有任何人(甚至服务器)了解q或文档的情况下选择和查看D中与q最相关的K个文档中的一个?这是无关文档排序和检索问题。在本文中，我们描述了Coeus，一个解决这个问题的系统。在高层次上，Coeus包含两个加密原语:使用广泛使用的术语频率逆文档频率(tf-idf)方法对文档相关性进行评分的安全矩阵-向量积，以及用于隐性检索文档的私有信息检索(PIR)。然而，Coeus减少了运行这些协议的时间，从而提高了用户感知的延迟，这是一个关键的性能指标。Coeus首先通过将私有元数据检索从文档检索中分离出来来减少PIR开销，然后通过一系列新的加密改进将安全矩阵向量积扩展到具有数千亿个元素的tf-idf矩阵。对于包含500万个文档的英文维基百科语料库，一个包含64K个关键字的关键字字典，以及AWS上143台机器的集群，Coeus使用户能够在3.9秒内对文档进行排序和检索，这比基线系统提高了24倍。

{"title":"Coeus: A System for Oblivious Document Ranking and Retrieval","authors":"Ishtiyaque Ahmad, Laboni Sarker, D. Agrawal, A. E. Abbadi, Trinabh Gupta","doi":"10.1145/3477132.3483586","DOIUrl":"https://doi.org/10.1145/3477132.3483586","url":null,"abstract":"Given a private string q and a remote server that holds a set of public documents D, how can one of the K most relevant documents to q in D be selected and viewed without anyone (not even the server) learning anything about q or the document? This is the oblivious document ranking and retrieval problem. In this paper, we describe Coeus, a system that solves this problem. At a high level, Coeus composes two cryptographic primitives: secure matrix-vector product for scoring document relevance using the widely-used term frequency-inverse document frequency (tf-idf) method, and private information retrieval (PIR) for obliviously retrieving documents. However, Coeus reduces the time to run these protocols, thereby improving the user-perceived latency, which is a key performance metric. Coeus first reduces the PIR overhead by separating out private metadata retrieval from document retrieval, and it then scales secure matrix-vector product to tf-idf matrices with several hundred billion elements through a series of novel cryptographic refinements. For a corpus of English Wikipedia containing 5 million documents, a keyword dictionary with 64K keywords, and on a cluster of 143 machines on AWS, Coeus enables a user to obliviously rank and retrieve a document in 3.9 seconds---a 24x improvement over a baseline system.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88056566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7