ACM Transactions on Computer Systems (TOCS)最新文献

英文中文

Optimizing the Block I/O Subsystem for Fast Storage Devices 快速存储设备块I/O子系统的优化

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2014-06-01 DOI: 10.1145/2619092

Youngjin Yu, Dongin Shin, Woong Shin, N. Song, Jae-Woo Choi, H. Kim, Hyeonsang Eom, H. Yeom

Fast storage devices are an emerging solution to satisfy data-intensive applications. They provide high transaction rates for DBMS, low response times for Web servers, instant on-demand paging for applications with large memory footprints, and many similar advantages for performance-hungry applications. In spite of the benefits promised by fast hardware, modern operating systems are not yet structured to take advantage of the hardware’s full potential. The software overhead caused by an OS, negligible in the past, adversely impacts application performance, lessening the advantage of using such hardware. Our analysis demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system. In this article, we propose six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices. With the support of new hardware interfaces, our optimizations minimize per-request latency by streamlining the I/O path and amortize per-request latency by maximizing parallelism inside the device. We demonstrate the impact on application performance through well-known storage benchmarks run against a Linux kernel with a customized SSD. We find that eliminating context switches in the I/O path decreases the software overhead of an I/O request from 20 microseconds to 5 microseconds and a new request merge scheme called Temporal Merge enables the OS to achieve 87% to 100% of peak device performance, regardless of request access patterns or types. Although the performance improvement by these optimizations on a standard SATA-based SSD is marginal (because of its limited interface and relatively high response times), our sensitivity analysis suggests that future SSDs with lower response times will benefit from these changes. The effectiveness of our optimizations encourages discussion between the OS community and storage vendors about future device interfaces for fast storage devices.

快速存储设备是满足数据密集型应用的新兴解决方案。它们为DBMS提供高事务率，为Web服务器提供低响应时间，为占用大量内存的应用程序提供即时按需分页，并为需要性能的应用程序提供许多类似的优势。尽管快速硬件带来了诸多好处，但现代操作系统的结构还不能充分利用硬件的全部潜力。操作系统造成的软件开销在过去可以忽略不计，但现在会对应用程序性能产生不利影响，从而降低了使用此类硬件的优势。我们的分析表明，传统存储堆栈设计的开销很大，如果不修改硬件接口并向操作系统添加新功能，就无法轻易克服。在本文中，我们提出了六种优化方法，使操作系统能够充分利用快速存储设备的性能特征。在新硬件接口的支持下，我们的优化通过简化I/O路径来最小化每个请求延迟，并通过最大化设备内部的并行性来分摊每个请求延迟。我们通过在带有定制SSD的Linux内核上运行著名的存储基准测试来演示对应用程序性能的影响。我们发现，消除I/O路径中的上下文切换可以将I/O请求的软件开销从20微秒减少到5微秒，并且一种称为临时合并的新请求合并方案使操作系统能够实现87%到100%的峰值设备性能，无论请求访问模式或类型如何。虽然这些优化在标准的基于sata的SSD上的性能改进是微不足道的(因为它的接口有限和相对较高的响应时间)，但我们的敏感性分析表明，响应时间较短的未来SSD将从这些更改中受益。我们优化的有效性鼓励了操作系统社区和存储供应商之间关于未来快速存储设备接口的讨论。

{"title":"Optimizing the Block I/O Subsystem for Fast Storage Devices","authors":"Youngjin Yu, Dongin Shin, Woong Shin, N. Song, Jae-Woo Choi, H. Kim, Hyeonsang Eom, H. Yeom","doi":"10.1145/2619092","DOIUrl":"https://doi.org/10.1145/2619092","url":null,"abstract":"Fast storage devices are an emerging solution to satisfy data-intensive applications. They provide high transaction rates for DBMS, low response times for Web servers, instant on-demand paging for applications with large memory footprints, and many similar advantages for performance-hungry applications. In spite of the benefits promised by fast hardware, modern operating systems are not yet structured to take advantage of the hardware’s full potential. The software overhead caused by an OS, negligible in the past, adversely impacts application performance, lessening the advantage of using such hardware. Our analysis demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system. In this article, we propose six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices. With the support of new hardware interfaces, our optimizations minimize per-request latency by streamlining the I/O path and amortize per-request latency by maximizing parallelism inside the device. We demonstrate the impact on application performance through well-known storage benchmarks run against a Linux kernel with a customized SSD. We find that eliminating context switches in the I/O path decreases the software overhead of an I/O request from 20 microseconds to 5 microseconds and a new request merge scheme called Temporal Merge enables the OS to achieve 87% to 100% of peak device performance, regardless of request access patterns or types. Although the performance improvement by these optimizations on a standard SATA-based SSD is marginal (because of its limited interface and relatively high response times), our sensitivity analysis suggests that future SSDs with lower response times will benefit from these changes. The effectiveness of our optimizations encourages discussion between the OS community and storage vendors about future device interfaces for fast storage devices.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125250836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators 探讨数据并行加速器中可编程性与效率之间的权衡

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2013-08-01 DOI: 10.1145/2491464

Yunsup Lee, Rimas Avizienis, Alex Bishara, R. Xia, Derek Lockhart, C. Batten, K. Asanović

We present a taxonomy and modular implementation approach for data-parallel accelerators, including the MIMD, vector-SIMD, subword-SIMD, SIMT, and vector-thread (VT) architectural design patterns. We introduce Maven, a new VT microarchitecture based on the traditional vector-SIMD microarchitecture, that is considerably simpler to implement and easier to program than previous VT designs. Using an extensive design-space exploration of full VLSI implementations of many accelerator design points, we evaluate the varying tradeoffs between programmability and implementation efficiency among the MIMD, vector-SIMD, and VT patterns on a workload of compiled microbenchmarks and application kernels. We find the vector cores provide greater efficiency than the MIMD cores, even on fairly irregular kernels. Our results suggest that the Maven VT microarchitecture is superior to the traditional vector-SIMD architecture, providing both greater efficiency and easier programmability.

我们提出了一种数据并行加速器的分类和模块化实现方法，包括MIMD、向量- simd、子词- simd、SIMT和向量-线程(VT)架构设计模式。本文介绍了一种基于传统矢量- simd微体系结构的新型VT微体系结构Maven，它比以前的VT设计更容易实现和编程。通过对许多加速器设计点的全VLSI实现进行广泛的设计空间探索，我们评估了在编译微基准测试和应用程序内核的工作负载上，MIMD、矢量simd和VT模式之间的可编程性和实现效率之间的不同权衡。我们发现矢量内核比MIMD内核提供更高的效率，即使在相当不规则的内核上也是如此。我们的结果表明，Maven VT微体系结构优于传统的矢量simd体系结构，提供更高的效率和更容易的可编程性。

引用次数: 99

Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments 协议责任分流，提高虚拟化环境下的TCP吞吐量

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2013-08-01 DOI: 10.1145/2491463

S. Gamage, R. Kompella, Dongyan Xu, Ardalan Kangarlou

Virtualization is a key technology that powers cloud computing platforms such as Amazon EC2. Virtual machine (VM) consolidation, where multiple VMs share a physical host, has seen rapid adoption in practice, with increasingly large numbers of VMs per machine and per CPU core. Our investigations, however, suggest that the increasing degree of VM consolidation has serious negative effects on the VMs’ TCP performance. As multiple VMs share a given CPU, the scheduling latencies, which can be in the order of tens of milliseconds, substantially increase the typically submillisecond round-trip times (RTTs) for TCP connections in a datacenter, causing significant degradation in throughput. In this article, we propose a lightweight solution, called vPRO, that (a) offloads the VM’s TCP congestion control function to the driver domain to improve TCP transmit performance; and (b) offloads TCP acknowledgment functionality to the driver domain to improve the TCP receive performance. Our evaluation of a vPRO prototype on Xen suggests that vPRO substantially improves TCP receive and transmit throughputs with minimal per-packet CPU overhead. We further show that the higher TCP throughput leads to improvement in application-level performance, via experiments with Apache Olio, a Web 2.0 cloud application, and Intel MPI benchmark.

虚拟化是支持云计算平台(如Amazon EC2)的关键技术。虚拟机(VM)整合，即多个虚拟机共享一个物理主机，在实践中得到了迅速的采用，每台机器和每个CPU核心的虚拟机数量越来越多。然而，我们的研究表明，虚拟机整合程度的增加对虚拟机的TCP性能有严重的负面影响。由于多个虚拟机共享一个给定的CPU，调度延迟(可能在几十毫秒量级)会大大增加数据中心中TCP连接的往返时间(rtt)，通常是亚毫秒级，从而导致吞吐量显著降低。在本文中，我们提出了一个轻量级的解决方案，称为vPRO，它(a)卸载虚拟机的TCP拥塞控制功能到驱动程序域，以提高TCP传输性能;(b)将TCP确认功能卸载到驱动域，以提高TCP接收性能。我们在Xen上对vPRO原型的评估表明，vPRO以最小的每个数据包CPU开销大幅提高了TCP接收和传输吞吐量。通过对Apache Olio(一个Web 2.0云应用程序)和Intel MPI基准的实验，我们进一步证明了更高的TCP吞吐量可以提高应用程序级性能。

{"title":"Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments","authors":"S. Gamage, R. Kompella, Dongyan Xu, Ardalan Kangarlou","doi":"10.1145/2491463","DOIUrl":"https://doi.org/10.1145/2491463","url":null,"abstract":"Virtualization is a key technology that powers cloud computing platforms such as Amazon EC2. Virtual machine (VM) consolidation, where multiple VMs share a physical host, has seen rapid adoption in practice, with increasingly large numbers of VMs per machine and per CPU core. Our investigations, however, suggest that the increasing degree of VM consolidation has serious negative effects on the VMs’ TCP performance. As multiple VMs share a given CPU, the scheduling latencies, which can be in the order of tens of milliseconds, substantially increase the typically submillisecond round-trip times (RTTs) for TCP connections in a datacenter, causing significant degradation in throughput. In this article, we propose a lightweight solution, called vPRO, that (a) offloads the VM’s TCP congestion control function to the driver domain to improve TCP transmit performance; and (b) offloads TCP acknowledgment functionality to the driver domain to improve the TCP receive performance. Our evaluation of a vPRO prototype on Xen suggests that vPRO substantially improves TCP receive and transmit throughputs with minimal per-packet CPU overhead. We further show that the higher TCP throughput leads to improvement in application-level performance, via experiments with Apache Olio, a Web 2.0 cloud application, and Intel MPI benchmark.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122278070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Spanner 扳手

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2012-10-08 DOI: 10.1145/2491245

J. Corbett, J. Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. Furman, S. Ghemawat, Andrey Gubarev, Christopher Heiser, P. Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, S. Melnik, David Mwaura, D. Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford

Spanner is Google’s scalable, multiversion, globally distributed, and synchronously replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This article describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free snapshot transactions, and atomic schema changes, across all of Spanner.

Spanner是Google的可扩展、多版本、全球分布式和同步复制数据库。它是第一个在全球范围内分发数据并支持外部一致的分布式事务的系统。本文描述了Spanner的结构、它的特性集、各种设计决策的基本原理，以及一个暴露时钟不确定性的新型时间API。该API及其实现对于支持外部一致性和各种强大功能至关重要:过去的非阻塞读取、无锁快照事务和跨所有Spanner的原子模式更改。

引用次数: 1406

The Next 700 BFT Protocols 未来700个BFT协议

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2008-12-15 DOI: 10.1145/2658994

R. Guerraoui

We present Abstract (ABortable STate mAChine replicaTion), a new abstraction for designing and reconfiguring generalized replicated state machines that are, unlike traditional state machines, allowed to abort executing a client’s request if “something goes wrong.” Abstract can be used to considerably simplify the incremental development of efficient Byzantine fault-tolerant state machine replication (BFT) protocols that are notorious for being difficult to develop. In short, we treat a BFT protocol as a composition of Abstract instances. Each instance is developed and analyzed independently and optimized for specific system conditions. We illustrate the power of Abstract through several interesting examples. We first show how Abstract can yield benefits of a state-of-the-art BFT protocol in a less painful and error-prone manner. Namely, we develop AZyzzyva, a new protocol that mimics the celebrated best-case behavior of Zyzzyva using less than 35% of the Zyzzyva code. To cover worst-case situations, our abstraction enables one to use in AZyzzyva any existing BFT protocol. We then present Aliph, a new BFT protocol that outperforms previous BFT protocols in terms of both latency (by up to 360%) and throughput (by up to 30%). Finally, we present R-Aliph, an implementation of Aliph that is robust, that is, whose performance degrades gracefully in the presence of Byzantine replicas and Byzantine clients.

我们提出了抽象(可中止状态机复制)，这是一种用于设计和重新配置广义复制状态机的新抽象，与传统状态机不同，它允许在“出现问题”时中止执行客户端请求。抽象可以用来大大简化高效的拜占庭容错状态机复制(BFT)协议的增量开发，而BFT协议是出了名的难以开发。简而言之，我们将BFT协议视为抽象实例的组合。每个实例都是独立开发和分析的，并针对特定的系统条件进行了优化。我们通过几个有趣的例子来说明抽象的力量。我们首先展示了Abstract如何以一种不那么痛苦和易出错的方式产生最先进的BFT协议的好处。也就是说，我们开发了azzzyva，一种新的协议，它模仿了Zyzzyva著名的最佳情况行为，使用不到35%的Zyzzyva代码。为了覆盖最坏的情况，我们的抽象允许在AZyzzyva中使用任何现有的BFT协议。然后，我们提出了一种新的BFT协议Aliph，它在延迟(高达360%)和吞吐量(高达30%)方面都优于以前的BFT协议。最后，我们介绍了R-Aliph，这是一种健壮的Aliph实现，也就是说，在存在拜占庭副本和拜占庭客户端时，其性能会优雅地下降。

{"title":"The Next 700 BFT Protocols","authors":"R. Guerraoui","doi":"10.1145/2658994","DOIUrl":"https://doi.org/10.1145/2658994","url":null,"abstract":"We present Abstract (ABortable STate mAChine replicaTion), a new abstraction for designing and reconfiguring generalized replicated state machines that are, unlike traditional state machines, allowed to abort executing a client’s request if “something goes wrong.” Abstract can be used to considerably simplify the incremental development of efficient Byzantine fault-tolerant state machine replication (BFT) protocols that are notorious for being difficult to develop. In short, we treat a BFT protocol as a composition of Abstract instances. Each instance is developed and analyzed independently and optimized for specific system conditions. We illustrate the power of Abstract through several interesting examples. We first show how Abstract can yield benefits of a state-of-the-art BFT protocol in a less painful and error-prone manner. Namely, we develop AZyzzyva, a new protocol that mimics the celebrated best-case behavior of Zyzzyva using less than 35% of the Zyzzyva code. To cover worst-case situations, our abstraction enables one to use in AZyzzyva any existing BFT protocol. We then present Aliph, a new BFT protocol that outperforms previous BFT protocols in terms of both latency (by up to 360%) and throughput (by up to 30%). Finally, we present R-Aliph, an implementation of Aliph that is robust, that is, whose performance degrades gracefully in the presence of Byzantine replicas and Byzantine clients.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134424630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 354

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Transactions on Computer Systems (TOCS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀