ACM Transactions on Computer Systems (TOCS)最新文献_第6页

Improving Resource Efficiency at Scale with Heracles 用赫拉克勒斯大规模提高资源效率

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2016-05-05 DOI: 10.1145/2882783

David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, Christos Kozyrakis

User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.

面向用户的、对延迟敏感的服务(如websearch)在日常低流量期间未充分利用其计算资源。在生产服务中很少为其他任务重用这些资源，因为对共享资源的争用可能导致延迟峰值，从而违反对延迟敏感的任务的服务级目标。由此导致的利用率不足损害了大型数据中心的可负担性和能源效率。随着摩尔定律的失效导致技术规模放缓，抓住这个机会变得非常重要。我们介绍了Heracles，这是一种基于反馈的控制器，可以在延迟关键服务的同时安全地配置最努力的任务。Heracles动态地管理多种硬件和软件隔离机制(如CPU、内存和网络隔离)，以确保对延迟敏感的作业满足延迟目标，同时最大限度地利用分配给“最佳努力”任务的资源。我们使用b谷歌的生产延迟关键型和批处理工作负载来评估Heracles，并演示了在我们评估的所有负载和托管场景中，平均服务器利用率为90%，没有延迟违规。

{"title":"Improving Resource Efficiency at Scale with Heracles","authors":"David Lo, Liqun Cheng, R. Govindaraju, Parthasarathy Ranganathan, Christos Kozyrakis","doi":"10.1145/2882783","DOIUrl":"https://doi.org/10.1145/2882783","url":null,"abstract":"User-facing, latency-sensitive services, such as websearch, underutilize their computing resources during daily periods of low traffic. Reusing those resources for other tasks is rarely done in production services since the contention for shared resources can cause latency spikes that violate the service-level objectives of latency-sensitive tasks. The resulting under-utilization hurts both the affordability and energy efficiency of large-scale datacenters. With the slowdown in technology scaling caused by the sunsetting of Moore’s law, it becomes important to address this opportunity. We present Heracles, a feedback-based controller that enables the safe colocation of best-effort tasks alongside a latency-critical service. Heracles dynamically manages multiple hardware and software isolation mechanisms, such as CPU, memory, and network isolation, to ensure that the latency-sensitive job meets latency targets while maximizing the resources given to best-effort tasks. We evaluate Heracles using production latency-critical and batch workloads from Google and demonstrate average server utilizations of 90% without latency violations across all the load and colocation scenarios that we evaluated.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 67

EOLE 产生的

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2016-04-21 DOI: 10.1145/2870632

Arthur Perais, André Seznec

Recent work in the field of value prediction (VP) has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, a simple recovery mechanism—pipeline squashing—can be used, whereas the out-of-order engine remains mostly unmodified. Yet, VP and validation at commit time require additional ports on the physical register file, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place, in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until commit time since predictions are validated at commit time. Consequently, a significant number of instructions—10% to 70% in our experiments—can bypass the out-of-order engine, allowing for a reduction of the issue width. This reduction paves the way for a truly practical implementation of VP. Furthermore, since VP in itself usually increases performance, our resulting {Early—Out-of-Order—Late} Execution architecture, EOLE, is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.

最近在价值预测(VP)领域的研究表明，给定一个有效的置信度估计机制，预测验证可以从无序引擎中移除，并延迟到提交时间。因此，可以使用一种简单的恢复机制——管道挤压，而无序的发动机基本上保持不变。然而，提交时的VP和验证需要物理寄存器文件上的额外端口，这可能导致端口总数无法承受。幸运的是，VP还意味着许多单周期ALU指令的操作数都在前端预测，并且可以按顺序就地执行。类似地，预测结果的单周期指令的执行可以延迟到提交时，因为在提交时验证了预测。因此，大量指令(在我们的实验中为10%到70%)可以绕过无序引擎，从而减少问题宽度。这种减少为VP的真正实际实现铺平了道路。此外，由于VP本身通常会提高性能，因此我们得到的{早乱序-晚乱序}执行架构EOLE通常比基线VP增强的6问题超标量更有效，同时具有明显更窄的4问题乱序引擎。

{"title":"EOLE","authors":"Arthur Perais, André Seznec","doi":"10.1145/2870632","DOIUrl":"https://doi.org/10.1145/2870632","url":null,"abstract":"Recent work in the field of value prediction (VP) has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, a simple recovery mechanism—pipeline squashing—can be used, whereas the out-of-order engine remains mostly unmodified. Yet, VP and validation at commit time require additional ports on the physical register file, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place, in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until commit time since predictions are validated at commit time. Consequently, a significant number of instructions—10% to 70% in our experiments—can bypass the out-of-order engine, allowing for a reduction of the issue width. This reduction paves the way for a truly practical implementation of VP. Furthermore, since VP in itself usually increases performance, our resulting {Early—Out-of-Order—Late} Execution architecture, EOLE, is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116998315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform 在单个键值存储服务器平台上实现每秒十亿请求吞吐量的全栈架构

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2016-04-06 DOI: 10.1145/2897393

Sheng Li, Hyeontaek Lim, V. Lee, Jung Ho Ahn, Anuj Kalia, M. Kaminsky, D. Andersen, S. O, Sukhan Lee, P. Dubey

Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented data center infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of data centers. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused on improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts also showed orders of magnitude improvement over stock memcached. We aim at architecting high-performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) (167MRPS with client-side batching) on a single commodity server. Our system delivers the best performance and energy efficiency (RPS/watt) demonstrated to date over existing KVSs including the best-published FPGA-based and GPU-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.

分布式内存中的键值存储(KVSs)，如memcached，已经成为现代面向internet的数据中心基础设施中的关键数据服务层。它们的性能和效率直接影响到web服务的QoS和数据中心的效率。传统上，由于网络处理效率低下、涉及OS内核和并发控制，这些系统有很大的开销。最近的两个研究重点集中在提高键值性能上。以硬件为中心的研究已经开始探索专门的平台，包括用于kvs的fpga;结果表明，吞吐量和能源效率比库存memcached提高了一个数量级。以软件为中心的研究重新审视了KVS应用程序，以解决基本的软件瓶颈，并充分利用现代商用硬件的潜力;这些努力也显示了相对于普通memcached的数量级改进。我们的目标是构建高性能和高效的KVS平台，并在一系列具有代表性的KVS实现的系统堆栈上开始严格的体系结构表征。我们详细的全系统特性不仅确定了高性能KVS系统的关键硬件/软件成分，而且还在最近的设计上进行了指向性优化，从而在单个商品服务器上实现了每秒1.2亿个请求(MRPS)的创纪录吞吐量(客户端批处理时为167MRPS)。我们的系统提供了迄今为止在现有kv中展示的最佳性能和能效(RPS/watt)，包括基于fpga和gpu的最佳发布声明。我们为未来的平台架构制定了一套设计原则，并通过详细的模拟证明了使用遵循我们原则构建的单个服务器实现十亿RPS的能力。

{"title":"Full-Stack Architecting to Achieve a Billion-Requests-Per-Second Throughput on a Single Key-Value Store Server Platform","authors":"Sheng Li, Hyeontaek Lim, V. Lee, Jung Ho Ahn, Anuj Kalia, M. Kaminsky, D. Andersen, S. O, Sukhan Lee, P. Dubey","doi":"10.1145/2897393","DOIUrl":"https://doi.org/10.1145/2897393","url":null,"abstract":"Distributed in-memory key-value stores (KVSs), such as memcached, have become a critical data serving layer in modern Internet-oriented data center infrastructure. Their performance and efficiency directly affect the QoS of web services and the efficiency of data centers. Traditionally, these systems have had significant overheads from inefficient network processing, OS kernel involvement, and concurrency control. Two recent research thrusts have focused on improving key-value performance. Hardware-centric research has started to explore specialized platforms including FPGAs for KVSs; results demonstrated an order of magnitude increase in throughput and energy efficiency over stock memcached. Software-centric research revisited the KVS application to address fundamental software bottlenecks and to exploit the full potential of modern commodity hardware; these efforts also showed orders of magnitude improvement over stock memcached. We aim at architecting high-performance and efficient KVS platforms, and start with a rigorous architectural characterization across system stacks over a collection of representative KVS implementations. Our detailed full-system characterization not only identifies the critical hardware/software ingredients for high-performance KVS systems but also leads to guided optimizations atop a recent design to achieve a record-setting throughput of 120 million requests per second (MRPS) (167MRPS with client-side batching) on a single commodity server. Our system delivers the best performance and energy efficiency (RPS/watt) demonstrated to date over existing KVSs including the best-published FPGA-based and GPU-based claims. We craft a set of design principles for future platform architectures, and via detailed simulations demonstrate the capability of achieving a billion RPS with a single server constructed following our principles.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121098138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment 在硬件/软件协同设计环境中用推测动态矢量器协助静态编译器矢量化

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2016-01-04 DOI: 10.1145/2807694

Rakesh Kumar, Alejandro Martínez, Antonio González

Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits vectorization opportunities. HW/SW codesigned processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime helps in capturing vectorization opportunities generally missed by the compilers. This article proposes to complement the static vectorization with a speculative dynamic vectorizer in an HW/SW codesigned processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The speculative reordering of memory instructions avoids the need for accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides a 2× performance benefit compared to the static GCC vectorization alone, for SPECFP2006. Furthermore, the speculative dynamic vectorizer is able to vectorize 48% of the loops that ICC failed to vectorize due to conservative dependence analysis in the TSVC benchmark suite. Moreover, the dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications. Furthermore, we show that speculation is not only a luxury but also a necessity for runtime vectorization.

基于编译器的静态向量化被广泛用于从计算密集型应用程序中提取数据级并行性。静态矢量化在传统的基于数组的矢量化应用中是非常有效的。然而，编译器无法进行精确的过程间指针消歧和过程间数组依赖分析，严重限制了向量化的机会。硬件/软件协同设计的处理器提供了在运行时优化应用程序的绝佳机会。运行时动态应用程序行为的可用性有助于捕获通常被编译器错过的向量化机会。本文建议在硬件/软件协同设计的处理器中使用推测动态矢量器来补充静态矢量化。我们提出了一种推测的动态向量化算法，该算法推测性地重新排序模糊的内存引用以发现向量化机会。内存指令的推测性重排序避免了精确的过程间指针消歧和过程间数组依赖分析的需要。硬件检查由于推测向量化而导致的任何内存依赖违反，并在违反时采取纠正措施。我们的实验表明，对于SPECFP2006，与单独的静态GCC矢量化相比，组合(静态+动态)矢量化方法提供了2倍的性能优势。此外，由于TSVC基准套件中的保守依赖分析，ICC无法对48%的循环进行矢量化，推测动态矢量化器能够对48%的循环进行矢量化。此外，动态向量化方案在基于指针的应用程序的向量化中与基于数组的应用程序一样有效，而编译器在基于指针的应用程序中失去了重要的向量化机会。此外，我们还表明，对于运行时矢量化来说，推测不仅是一种奢侈，而且是必要的。

{"title":"Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment","authors":"Rakesh Kumar, Alejandro Martínez, Antonio González","doi":"10.1145/2807694","DOIUrl":"https://doi.org/10.1145/2807694","url":null,"abstract":"Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits vectorization opportunities. HW/SW codesigned processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime helps in capturing vectorization opportunities generally missed by the compilers. This article proposes to complement the static vectorization with a speculative dynamic vectorizer in an HW/SW codesigned processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The speculative reordering of memory instructions avoids the need for accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides a 2× performance benefit compared to the static GCC vectorization alone, for SPECFP2006. Furthermore, the speculative dynamic vectorizer is able to vectorize 48% of the loops that ICC failed to vectorize due to conservative dependence analysis in the TSVC benchmark suite. Moreover, the dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications. Furthermore, we show that speculation is not only a luxury but also a necessity for runtime vectorization.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124072725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Fast and Portable Locking for Multicore Architectures 多核架构的快速可移植锁定

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2016-01-04 DOI: 10.1145/2845079

Jean-Pierre Lozi, Florian David, Gaël Thomas, J. Lawall, Gilles Muller

The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.

由于访问争用和缓存丢失的代价，当前多核系统上多线程应用程序的可伸缩性受到锁算法性能的限制。本文提出的主要贡献是一种新的锁定技术，远程核心锁定(RCL)，旨在加速多核体系结构上遗留应用程序中关键部分的执行。RCL的思想是通过对专用服务器硬件线程的优化远程过程调用来取代锁获取。当许多线程试图并发地获取锁时，RCL限制了使用其他锁算法观察到的性能崩溃，并且不需要将受锁保护的共享数据传输给获取锁的硬件线程，因为这些数据通常可以保留在服务器的缓存中。本文中提出的其他贡献包括一个分析器，它可以识别多线程应用程序中的瓶颈锁，从而可以从RCL中受益，以及一个将POSIX锁获取转换为RCL锁的再工程工具。我们使用了18个应用程序来评估RCL: SPLASH-2基准套件的9个应用程序、Phoenix 2基准套件的7个应用程序、Memcached和带有TPC-C客户机的Berkeley DB。其中8个应用程序由于锁而无法扩展，而在一台×86机器上使用4个AMD Opteron处理器和48个硬件线程可以从RCL中获益。通过使用RCL而不是Linux POSIX锁，Memcached上的性能提高了2.5倍，使用TPC-C客户机的Berkeley DB上的性能提高了11.6倍。在具有两个Sun Ultrasparc T2+处理器和128个硬件线程的SPARC机器上，有三个应用程序受益于RCL。特别是，相对于Memcached上的Solaris POSIX锁，性能提高了1.3倍，使用TPC-C客户端在Berkeley DB上的性能提高了7.9倍。

{"title":"Fast and Portable Locking for Multicore Architectures","authors":"Jean-Pierre Lozi, Florian David, Gaël Thomas, J. Lawall, Gilles Muller","doi":"10.1145/2845079","DOIUrl":"https://doi.org/10.1145/2845079","url":null,"abstract":"The scalability of multithreaded applications on current multicore systems is hampered by the performance of lock algorithms, due to the costs of access contention and cache misses. The main contribution presented in this article is a new locking technique, Remote Core Locking (RCL), that aims to accelerate the execution of critical sections in legacy applications on multicore architectures. The idea of RCL is to replace lock acquisitions by optimized remote procedure calls to a dedicated server hardware thread. RCL limits the performance collapse observed with other lock algorithms when many threads try to acquire a lock concurrently and removes the need to transfer lock-protected shared data to the hardware thread acquiring the lock, because such data can typically remain in the server’s cache. Other contributions presented in this article include a profiler that identifies the locks that are the bottlenecks in multithreaded applications and that can thus benefit from RCL, and a reengineering tool that transforms POSIX lock acquisitions into RCL locks. Eighteen applications were used to evaluate RCL: the nine applications of the SPLASH-2 benchmark suite, the seven applications of the Phoenix 2 benchmark suite, Memcached, and Berkeley DB with a TPC-C client. Eight of these applications are unable to scale because of locks and benefit from RCL on an ×86 machine with four AMD Opteron processors and 48 hardware threads. By using RCL instead of Linux POSIX locks, performance is improved by up to 2.5 times on Memcached, and up to 11.6 times on Berkeley DB with the TPC-C client. On a SPARC machine with two Sun Ultrasparc T2+ processors and 128 hardware threads, three applications benefit from RCL. In particular, performance is improved by up to 1.3 times with respect to Solaris POSIX locks on Memcached, and up to 7.9 times on Berkeley DB with the TPC-C client.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131390307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Arrakis

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2015-11-02 DOI: 10.1145/2812806

Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, Doug Woos, A. Krishnamurthy, T. Anderson, Timothy Roscoe

Recent device hardware trends enable a new approach to the design of network server operating systems. In a traditional operating system, the kernel mediates access to device hardware by server applications to enforce process isolation as well as network and disk security. We have designed and implemented a new operating system, Arrakis, that splits the traditional role of the kernel in two. Applications have direct access to virtualized I/O devices, allowing most I/O operations to skip the kernel entirely, while the kernel is re-engineered to provide network and disk protection without kernel mediation of every operation. We describe the hardware and software changes needed to take advantage of this new abstraction, and we illustrate its power by showing improvements of 2 to 5 × in latency and 9 × throughput for a popular persistent NoSQL store relative to a well-tuned Linux implementation.

最近的设备硬件趋势为网络服务器操作系统的设计提供了一种新的方法。在传统的操作系统中，内核协调服务器应用程序对设备硬件的访问，以加强进程隔离以及网络和磁盘安全性。我们设计并实现了一个新的操作系统Arrakis，它将内核的传统角色一分为二。应用程序可以直接访问虚拟I/O设备，允许大多数I/O操作完全跳过内核，而内核经过重新设计以提供网络和磁盘保护，而无需内核中介每个操作。我们描述了利用这种新抽象所需的硬件和软件更改，并通过展示与调优的Linux实现相比，流行的持久NoSQL存储的延迟提高了2到5倍，吞吐量提高了9倍来说明它的强大功能。

引用次数: 12

Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures 粗粒度空间体系结构的有效控制和通信范式

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2015-09-11 DOI: 10.1145/2754930

Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

最近，人们对探索使用空间编程架构加速非向量化工作负载很感兴趣，这种架构旨在有效地利用管道并行性。这种架构面临两个主要问题:如何有效地控制系统中的每个处理元素(PE)，以及如何在不增加传统共享内存的开销的情况下促进PE之间的通信。在本文中，我们将探讨使用触发指令和延迟不敏感通道来解决这些问题。触发指令完全消除了程序计数器(PC)，并允许程序在没有显式分支指令的情况下简洁地在状态之间转换。延迟不敏感通道允许pe间控制信息的有效通信，同时支持灵活的代码放置并提高对可变事件(如缓存访问)的容忍度。总之，这些方法提供了一种统一的机制来避免过度序列化的执行，基本上达到了动态指令重排序和多线程等技术的效果。我们的分析表明，使用触发指令和延迟不敏感通道的空间加速器可以实现比传统通用处理器高8倍的面积归一化性能。进一步的分析表明，与pc风格的基线相比，触发控制将关键路径中的静态和动态指令的数量分别减少了62%和64%，将空间编程方法的性能提高了2.0倍。

{"title":"Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures","authors":"Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer","doi":"10.1145/2754930","DOIUrl":"https://doi.org/10.1145/2754930","url":null,"abstract":"There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131755497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

SKMD

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2015-08-31 DOI: 10.1145/2798725

Janghaeng Lee, M. Samadi, Yongjun Park, S. Mahlke

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.

传统上，CPU和GPU上的异构计算为每个设备使用固定的角色:GPU通过利用其大量内核来处理数据并行工作，而CPU处理非数据并行工作，例如顺序代码或数据传输管理。这种工作分布可能是一个糟糕的解决方案，因为它没有充分利用CPU，很难推广到单一CPU- gpu组合之外，并且可能浪费大量的时间来传输数据。此外，在许多工作负载上，cpu在性能上与gpu竞争，因此简单地根据固定角色对工作进行分区可能是一个糟糕的选择。在本文中，我们介绍了单内核多设备(SKMD)系统，这是一个框架，它透明地协调跨多个非对称cpu和gpu的单个数据并行内核的协作执行。程序员负责在OpenCL中开发单个数据并行内核，而系统自动将工作负载划分为任意一组设备，生成内核来执行部分工作负载，并有效地将部分输出合并在一起。目标是通过最大限度地利用所有可用资源来执行内核来提高性能。SKMD处理了暴露的数据传输成本和gpu在输入大小方面的性能变化的困难挑战。在实际硬件上，与一组流行的OpenCL内核的最快设备执行策略相比，SKMD在具有一个多核CPU和两个非对称gpu的系统上实现了28%的平均加速。

{"title":"SKMD","authors":"Janghaeng Lee, M. Samadi, Yongjun Park, S. Mahlke","doi":"10.1145/2798725","DOIUrl":"https://doi.org/10.1145/2798725","url":null,"abstract":"Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. This work distribution can be a poor solution as it underutilizes the CPU, has difficulty generalizing beyond the single CPU-GPU combination, and may waste a large fraction of time transferring data. Further, CPUs are performance competitive with GPUs on many workloads, thus simply partitioning work based on the fixed roles may be a poor choice. In this article, we present the single-kernel multiple devices (SKMD) system, a framework that transparently orchestrates collaborative execution of a single data-parallel kernel across multiple asymmetric CPUs and GPUs. The programmer is responsible for developing a single data-parallel kernel in OpenCL, while the system automatically partitions the workload across an arbitrary set of devices, generates kernels to execute the partial workloads, and efficiently merges the partial outputs together. The goal is performance improvement by maximally utilizing all available resources to execute the kernel. SKMD handles the difficult challenges of exposed data transfer costs and the performance variations GPUs have with respect to input size. On real hardware, SKMD achieves an average speedup of 28% on a system with one multicore CPU and two asymmetric GPUs compared to a fastest device execution strategy for a set of popular OpenCL kernels.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127900009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

The RAMCloud Storage System RAMCloud存储系统

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2015-08-31 DOI: 10.1145/2806887

J. Ousterhout, A. Gopalan, A. Gupta, Kejriwal A., Collin Lee, Behnam Montazeri, Diego Ongaro, S. Park, Henry Qin, M. Rosenblum, Stephen M. Rumble, Ryan Stutsman, Stephen Yang

RAMCloud is a storage system that provides low-latency access to large-scale datasets. To achieve low latency, RAMCloud stores all data in DRAM at all times. To support large capacities (1PB or more), it aggregates the memories of thousands of servers into a single coherent key-value store. RAMCloud ensures the durability of DRAM-based data by keeping backup copies on secondary storage. It uses a uniform log-structured mechanism to manage both DRAM and secondary storage, which results in high performance and efficient memory usage. RAMCloud uses a polling-based approach to communication, bypassing the kernel to communicate directly with NICs; with this approach, client applications can read small objects from any RAMCloud storage server in less than 5μs, durable writes of small objects take about 13.5μs. RAMCloud does not keep multiple copies of data online; instead, it provides high availability by recovering from crashes very quickly (1 to 2 seconds). RAMCloud’s crash recovery mechanism harnesses the resources of the entire cluster working concurrently so that recovery performance scales with cluster size.

RAMCloud是一个存储系统，提供对大规模数据集的低延迟访问。为了实现低延迟，RAMCloud在任何时候都将所有数据存储在DRAM中。为了支持大容量(1PB或更多)，它将数千台服务器的内存聚合到一个一致的键值存储中。RAMCloud通过在二级存储上保留备份副本来确保基于ram的数据的持久性。它使用统一的日志结构机制来管理DRAM和辅助存储，从而实现高性能和高效的内存使用。RAMCloud使用基于轮询的方法进行通信，绕过内核直接与网卡通信;使用这种方法，客户机应用程序可以在不到5μs的时间内从任何RAMCloud存储服务器读取小对象，持久地写入小对象大约需要13.5μs。RAMCloud不会在线保存多个数据副本;相反，它通过从崩溃中快速恢复(1到2秒)来提供高可用性。RAMCloud的崩溃恢复机制利用整个集群并发工作的资源，以便恢复性能随集群大小而变化。

{"title":"The RAMCloud Storage System","authors":"J. Ousterhout, A. Gopalan, A. Gupta, Kejriwal A., Collin Lee, Behnam Montazeri, Diego Ongaro, S. Park, Henry Qin, M. Rosenblum, Stephen M. Rumble, Ryan Stutsman, Stephen Yang","doi":"10.1145/2806887","DOIUrl":"https://doi.org/10.1145/2806887","url":null,"abstract":"RAMCloud is a storage system that provides low-latency access to large-scale datasets. To achieve low latency, RAMCloud stores all data in DRAM at all times. To support large capacities (1PB or more), it aggregates the memories of thousands of servers into a single coherent key-value store. RAMCloud ensures the durability of DRAM-based data by keeping backup copies on secondary storage. It uses a uniform log-structured mechanism to manage both DRAM and secondary storage, which results in high performance and efficient memory usage. RAMCloud uses a polling-based approach to communication, bypassing the kernel to communicate directly with NICs; with this approach, client applications can read small objects from any RAMCloud storage server in less than 5μs, durable writes of small objects take about 13.5μs. RAMCloud does not keep multiple copies of data online; instead, it provides high availability by recovering from crashes very quickly (1 to 2 seconds). RAMCloud’s crash recovery mechanism harnesses the resources of the entire cluster working concurrently so that recovery performance scales with cluster size.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"131 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124251671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 280

K2 K2

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2015-06-08 DOI: 10.1145/2699676

F. Lin, Zhen Wang, Lin Zhong

Mobile System-on-Chips (SoC) that incorporate heterogeneous coherence domains promise high energy efficiency to a wide range of mobile applications, yet are difficult to program. To exploit the architecture, a desirable, yet missing capability is to replicate operating system (OS) services over multiple coherence domains with minimum inter-domain communication. In designing such an OS, we set three goals: to ease application development, to simplify OS engineering, and to preserve the current OS performance. To this end, we identify a shared-most OS model for multiple coherence domains: creating per-domain instances of core OS services with no shared state, while enabling other extended OS services to share state across domains. To test the model, we build K2, a prototype OS on the TI OMAP4 SoC, by reusing most of the Linux 3.4 source. K2 presents a single system image to applications with its two kernels running on top of the two coherence domains of OMAP4. The two kernels have independent instances of core OS services, such as page allocation and interrupt management, as coordinated by K2; the two kernels share most extended OS services, such as device drivers, whose state is kept coherent transparently by K2. Despite platform constraints and unoptimized code, K2 improves energy efficiency for light OS workloads by 8x-10x, while incurring less than 9% performance overhead for two device drivers shared between kernels. Our experiences with K2 show that the shared-most model is promising.

包含异构相干域的移动系统芯片(SoC)为广泛的移动应用提供了高能效，但难以编程。要利用该体系结构，一个理想的但缺失的功能是在多个一致性域中复制操作系统(OS)服务，并使用最少的域间通信。在设计这样一个操作系统时，我们设定了三个目标:简化应用程序开发，简化操作系统工程，并保持当前操作系统的性能。为此，我们为多个相干域确定了一个共享最多的操作系统模型:创建没有共享状态的核心操作系统服务的每个域实例，同时使其他扩展操作系统服务能够跨域共享状态。为了测试该模型，我们通过重用大部分Linux 3.4源代码，在TI OMAP4 SoC上构建了K2，这是一个原型操作系统。K2为应用程序提供了一个单一的系统映像，其两个内核运行在OMAP4的两个相干域之上。这两个内核有独立的核心操作系统服务实例，如页面分配和中断管理，由K2协调;这两个内核共享大多数扩展的操作系统服务，例如设备驱动程序，其状态由K2透明地保持一致。尽管有平台限制和未优化的代码，K2将轻量级操作系统工作负载的能源效率提高了8 -10倍，同时为内核之间共享的两个设备驱动程序带来不到9%的性能开销。我们在K2项目上的经验表明，共享最多模式是有希望的。

{"title":"K2","authors":"F. Lin, Zhen Wang, Lin Zhong","doi":"10.1145/2699676","DOIUrl":"https://doi.org/10.1145/2699676","url":null,"abstract":"Mobile System-on-Chips (SoC) that incorporate heterogeneous coherence domains promise high energy efficiency to a wide range of mobile applications, yet are difficult to program. To exploit the architecture, a desirable, yet missing capability is to replicate operating system (OS) services over multiple coherence domains with minimum inter-domain communication. In designing such an OS, we set three goals: to ease application development, to simplify OS engineering, and to preserve the current OS performance. To this end, we identify a shared-most OS model for multiple coherence domains: creating per-domain instances of core OS services with no shared state, while enabling other extended OS services to share state across domains. To test the model, we build K2, a prototype OS on the TI OMAP4 SoC, by reusing most of the Linux 3.4 source. K2 presents a single system image to applications with its two kernels running on top of the two coherence domains of OMAP4. The two kernels have independent instances of core OS services, such as page allocation and interrupt management, as coordinated by K2; the two kernels share most extended OS services, such as device drivers, whose state is kept coherent transparently by K2. Despite platform constraints and unoptimized code, K2 improves energy efficiency for light OS workloads by 8x-10x, while incurring less than 9% performance overhead for two device drivers shared between kernels. Our experiences with K2 show that the shared-most model is promising.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131749728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8