ACM Transactions on Computer Systems最新文献_第3页

Scaling Membership of Byzantine Consensus 拜占庭共识的扩容成员

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2020-11-30 DOI: 10.1145/3473138

Burcu Canakci, R. V. Renesse

Scaling Byzantine Fault Tolerant (BFT) systems in terms of membership is important for secure applications with large participation such as blockchains. While traditional protocols have low latency, they cannot handle many processors. Conversely, blockchains often have hundreds to thousands of processors to increase robustness, but they typically have high latency or energy costs. We describe various sources of unscalability in BFT consensus protocols. To improve performance, many BFT protocols optimize the “normal case,” where there are no failures. This can be done in a modular fashion by wrapping existing BFT protocols with a building block that we call alliance. In normal case executions, alliance can scalably determine if the initial conditions of a BFT consensus protocol predetermine the outcome, obviating running the consensus protocol. We give examples of existing protocols that solve alliance. We show that a solution based on hypercubes and MACs has desirable scalability and performance in normal case executions, with only a modest overhead otherwise. We provide important optimizations. Finally, we evaluate our solution using the ns3 simulator and show that it scales up to thousands of processors and compare with prior work in various network topologies.

在成员方面扩展拜占庭容错(BFT)系统对于区块链等具有大量参与的安全应用程序非常重要。虽然传统协议具有低延迟，但它们不能处理许多处理器。相反，区块链通常有数百到数千个处理器来增加鲁棒性，但它们通常具有高延迟或能源成本。我们描述了BFT共识协议中不可扩展性的各种来源。为了提高性能，许多BFT协议对没有故障的“正常情况”进行了优化。这可以以模块化的方式完成，方法是用我们称之为联盟的构建块包装现有的BFT协议。在正常情况下，联盟可以可扩展地确定BFT共识协议的初始条件是否预先决定了结果，从而避免运行共识协议。我们给出了解决联盟的现有协议的例子。我们展示了基于超多维数据集和mac的解决方案在正常情况下具有理想的可伸缩性和性能，在其他情况下只有适度的开销。我们提供重要的优化。最后，我们使用ns3模拟器评估了我们的解决方案，并展示了它可以扩展到数千个处理器，并与各种网络拓扑下的先前工作进行了比较。

{"title":"Scaling Membership of Byzantine Consensus","authors":"Burcu Canakci, R. V. Renesse","doi":"10.1145/3473138","DOIUrl":"https://doi.org/10.1145/3473138","url":null,"abstract":"Scaling Byzantine Fault Tolerant (BFT) systems in terms of membership is important for secure applications with large participation such as blockchains. While traditional protocols have low latency, they cannot handle many processors. Conversely, blockchains often have hundreds to thousands of processors to increase robustness, but they typically have high latency or energy costs. We describe various sources of unscalability in BFT consensus protocols. To improve performance, many BFT protocols optimize the “normal case,” where there are no failures. This can be done in a modular fashion by wrapping existing BFT protocols with a building block that we call alliance. In normal case executions, alliance can scalably determine if the initial conditions of a BFT consensus protocol predetermine the outcome, obviating running the consensus protocol. We give examples of existing protocols that solve alliance. We show that a solution based on hypercubes and MACs has desirable scalability and performance in normal case executions, with only a modest overhead otherwise. We provide important optimizations. Finally, we evaluate our solution using the ns3 simulator and show that it scales up to thousands of processors and compare with prior work in various network topologies.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1 1","pages":"1 - 31"},"PeriodicalIF":1.5,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64045185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SmartIO: Zero-overhead Device Sharing through PCIe Networking SmartIO:通过PCIe组网实现零开销设备共享

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2020-01-01 DOI: 10.1145/3462545

Jonas Markussen, Lars Bjørlykke Kristiansen, P. Halvorsen, Halvor Kielland-Gyrud, H. Stensland, C. Griwodz

The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.

各种计算量大、数据驱动的应用程序加速了对分布式I/O解决方案的需求，这种解决方案支持在网络主机之间经济高效地扩展资源。例如，在集群系统中，不同的机器可能在不同的时间有不同的设备可用，但是通过网络将工作负载移动到远程单元的成本通常很高，并且与访问本地资源相比会带来很大的开销。为了促进使用外围组件互连快速(PCIe)非透明桥连接的主机之间的I/O分解和设备共享，我们提出了SmartIO。nvme、gpu、网络适配器或任何其他标准PCIe设备都可以直接借用和访问，就好像它们是远程机器的本地设备一样。通过将传统的I/O与分布式共享内存功能相结合，我们提供了超越现有分解解决方案的功能，允许设备成为与集群应用程序相同的全局地址空间的一部分。软件完全从数据路径中移除，并且可以在远程主机上运行的应用程序进程之间同时共享设备。我们的实验结果表明，I/O设备可以与远程主机共享，实现本地PCIe性能。因此，与现有的设备分配机制相比，SmartIO提供了更高效、更低成本的资源共享，从而提高了系统的整体性能。

{"title":"SmartIO: Zero-overhead Device Sharing through PCIe Networking","authors":"Jonas Markussen, Lars Bjørlykke Kristiansen, P. Halvorsen, Halvor Kielland-Gyrud, H. Stensland, C. Griwodz","doi":"10.1145/3462545","DOIUrl":"https://doi.org/10.1145/3462545","url":null,"abstract":"The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"36 1","pages":"2:1-2:78"},"PeriodicalIF":1.5,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74968075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

L4 Microkernels: The Lessons from 20 Years of Research and Deployment L4微内核:来自20年研究和部署的经验教训

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2016-04-06 DOI: 10.1145/2893177

G. Heiser, Kevin Elphinstone

The L4 microkernel has undergone 20 years of use and evolution. It has an active user and developer community, and there are commercial versions that are deployed on a large scale and in safety-critical systems. In this article we examine the lessons learnt in those 20 years about microkernel design and implementation. We revisit the L4 design articles and examine the evolution of design and implementation from the original L4 to the latest generation of L4 kernels. We specifically look at seL4, which has pushed the L4 model furthest and was the first OS kernel to undergo a complete formal verification of its implementation as well as a sound analysis of worst-case execution times. We demonstrate that while much has changed, the fundamental principles of minimality, generality, and high inter-process communication (IPC) performance remain the main drivers of design and implementation decisions.

L4微内核经历了20年的使用和发展。它拥有活跃的用户和开发人员社区，并且在安全关键系统中大规模部署了商业版本。在本文中，我们将研究这20年来关于微内核设计和实现的经验教训。我们将回顾L4设计文章，并研究从最初的L4到最新一代L4内核的设计和实现的演变。我们特别关注seL4，它将L4模型推向了极致，并且是第一个对其实现进行完整正式验证以及对最坏情况执行时间进行合理分析的操作系统内核。我们证明，虽然发生了很大的变化，但最小化、通用性和高进程间通信(IPC)性能的基本原则仍然是设计和实现决策的主要驱动因素。

引用次数: 63

Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis 通过重用距离分析识别高能效多核缓存层次结构

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2016-04-06 DOI: 10.1145/2851503

Michael Badamo, Jeff Casarona, Minshu Zhao, D. Yeung

To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies. A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption. This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.

为了以节能的方式实现性能改进，计算机架构师一直在构建利用更多线程级并行性的cpu。这类cpu的一个关键考虑因素是正确设计片上缓存层次结构。不幸的是，这可能很难做到，特别是对于具有高核数和大量缓存的cpu。组织缓存层次结构的组合方式所形成的巨大设计空间使得很难确定节能配置。此外，作为进行此类设计空间研究的主要手段，建筑模拟的缓慢速度加剧了这一问题。一个可以帮助架构师优化CPU缓存层次结构的强大工具是重用距离(RD)分析。最近的工作扩展了单处理器RD技术。，通过引入并发RD和私有堆栈RD分析，可以分析多核cpu中不同类型的缓存。一旦获得，并行局部性概要文件可以预测许多缓存配置的性能，允许高效的设计空间探索。迄今为止，多核RD分析的现有工作主要集中在开发分析技术和评估其准确性上。不幸的是，目前还没有关于使用RD分析来优化CPU性能或功耗的工作。本文将研究如何应用多核RD分析来确定多核CPU最节能的缓存配置。首先，我们开发了分析模型，该模型使用来自并行局部性配置文件的缓存缺失计数来估计CPU性能和功耗。虽然未来可扩展的cpu可能会采用多线程(甚至乱序)内核，但我们目前的研究假设单线程有序内核来简化模型，使我们能够专注于缓存层次结构和基于rd的技术。其次，为了演示我们的技术的实用性，我们应用我们的模型来优化具有两级缓存层次结构的大规模平铺CPU架构。我们展示了最节能的配置在不同的基准测试中差异很大，并且我们的局部配置文件提供了深入了解为什么某些配置是节能的。我们还展示了选择最佳配置可以提供显著的收益，因为在我们的平铺式CPU设计空间中有2.01倍的功率效率。最后，我们通过详细的仿真验证了我们的技术的准确性。在几种模拟配置中，我们的技术通常可以选择最节能的配置，或者非常接近最佳的配置。此外，在所有模拟配置中，我们可以以15.2%的误差预测功率效率。

{"title":"Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis","authors":"Michael Badamo, Jeff Casarona, Minshu Zhao, D. Yeung","doi":"10.1145/2851503","DOIUrl":"https://doi.org/10.1145/2851503","url":null,"abstract":"To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies.\u0000 A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling—to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption.\u0000 This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"307 1","pages":"3:1-3:30"},"PeriodicalIF":1.5,"publicationDate":"2016-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78258573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Designing Future Warehouse-Scale Computers for Sirius, an End-to-End Voice and Vision Personal Assistant 为天狼星设计未来的仓库级计算机，一个端到端的语音和视觉个人助理

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2016-04-06 DOI: 10.1145/2870631

Johann Hauswald, M. Laurenzano, Yunqi Zhang, Hailong Yang, Yiping Kang, Cheng Li, A. Rovinski, Arjun Khurana, R. Dreslinski, T. Mudge, V. Petrucci, Lingjia Tang, Jason Mars

As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.

随着用户对苹果的Siri、谷歌的Google Now和微软的Cortana等智能个人助理(IPAs)的需求不断扩大，我们正在接近当前数据中心(DC)架构的计算极限。未来的服务器架构应该如何发展以支持这类新兴的应用程序，这是一个悬而未决的问题，而缺乏开源的IPA工作负载是解决这个问题的一个障碍。在本文中，我们介绍了Sirius的设计，这是一个开放的端到端IPA web服务应用程序，它接受语音和图像形式的查询，并使用自然语言进行响应。然后，我们使用此工作负载来研究未来基于加速器的服务器架构(跨越传统cpu、gpu、多核吞吐量协处理器和fpga)在设计空间中的四个要点的含义。为了研究天狼星未来的服务器设计，我们将天狼星分解为一组八个基准测试(天狼星套件)，其中包括天狼星的计算密集型瓶颈。我们将Sirius Suite移植到一系列加速器平台，并使用这些平台之间的性能和功耗权衡来执行各种服务器设计点的总拥有成本(TCO)分析。在我们的研究中，我们发现加速器对IPA服务的未来可扩展性至关重要。我们的研究结果表明，GPU和fpga加速服务器的查询延迟平均分别提高了8.5倍和15倍。对于给定的吞吐量，GPU和fpga加速的服务器可以将数据中心的TCO分别降低2.3倍和1.3倍。

{"title":"Designing Future Warehouse-Scale Computers for Sirius, an End-to-End Voice and Vision Personal Assistant","authors":"Johann Hauswald, M. Laurenzano, Yunqi Zhang, Hailong Yang, Yiping Kang, Cheng Li, A. Rovinski, Arjun Khurana, R. Dreslinski, T. Mudge, V. Petrucci, Lingjia Tang, Jason Mars","doi":"10.1145/2870631","DOIUrl":"https://doi.org/10.1145/2870631","url":null,"abstract":"As user demand scales for intelligent personal assistants (IPAs) such as Apple’s Siri, Google’s Google Now, and Microsoft’s Cortana, we are approaching the computational limits of current datacenter (DC) architectures. It is an open question how future server architectures should evolve to enable this emerging class of applications, and the lack of an open-source IPA workload is an obstacle in addressing this question. In this article, we present the design of Sirius, an open end-to-end IPA Web-service application that accepts queries in the form of voice and images, and responds with natural language. We then use this workload to investigate the implications of four points in the design space of future accelerator-based server architectures spanning traditional CPUs, GPUs, manycore throughput co-processors, and FPGAs. To investigate future server designs for Sirius, we decompose Sirius into a suite of eight benchmarks (Sirius Suite) comprising the computationally intensive bottlenecks of Sirius. We port Sirius Suite to a spectrum of accelerator platforms and use the performance and power trade-offs across these platforms to perform a total cost of ownership (TCO) analysis of various server design points. In our study, we find that accelerators are critical for the future scalability of IPA services. Our results show that GPU- and FPGA-accelerated servers improve the query latency on average by 8.5× and 15×, respectively. For a given throughput, GPU- and FPGA-accelerated servers can reduce the TCO of DCs by 2.3× and 1.3×, respectively.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"27 1","pages":"2:1-2:32"},"PeriodicalIF":1.5,"publicationDate":"2016-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78009252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Approximate Storage in Solid-State Memories 固态存储器中的近似存储

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2014-09-23 DOI: 10.1145/2644808

Adrian Sampson, Jacob Nelson, K. Strauss, L. Ceze

Memories today expose an all-or-nothing correctness model that incurs significant costs in performance, energy, area, and design complexity. But not all applications need high-precision storage for all of their data structures all of the time. This article proposes mechanisms that enable applications to store data approximately and shows that doing so can improve the performance, lifetime, or density of solid-state memories. We propose two mechanisms. The first allows errors in multilevel cells by reducing the number of programming pulses used to write them. The second mechanism mitigates wear-out failures and extends memory endurance by mapping approximate data onto blocks that have exhausted their hardware error correction resources. Simulations show that reduced-precision writes in multilevel phase-change memory cells can be 1.7 × faster on average and using failed blocks can improve array lifetime by 23% on average with quality loss under 10%.

今天的内存暴露了一种要么全有要么全无的正确性模型，这种模型在性能、能源、面积和设计复杂性方面付出了巨大的代价。但并不是所有的应用程序都需要高精度存储所有的数据结构。本文提出了使应用程序能够近似地存储数据的机制，并展示了这样做可以提高固态存储器的性能、寿命或密度。我们提出两种机制。第一种方法通过减少用于写入的编程脉冲的数量，允许在多层单元中出现错误。第二种机制通过将近似数据映射到已经耗尽其硬件纠错资源的块上，减轻了损耗故障并扩展了内存持久性。仿真结果表明，在多级相变存储单元中，降低精度的写入速度平均提高1.7倍，使用失效块可使阵列寿命平均提高23%，质量损失低于10%。

引用次数: 1

Scaling Performance via Self-Tuning Approximation for Graphics Engines 通过图形引擎的自调整近似缩放性能

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2014-09-23 DOI: 10.1145/2631913

M. Samadi, Janghaeng Lee, D. Jamshidi, S. Mahlke, Amir Hormati

Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation, and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach—SAGE—combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.

近似计算(以计算精度为代价换取更好的性能或更高的数据吞吐量)是一种解决方案，可以帮助数据处理跟上当前和日益丰富的信息。对于特定的领域，如多媒体和学习算法，近似是目前常用的。我们认为自动化对于提供透明的近似是必不可少的，并且我们展示了通过构造适合底层硬件的近似技术可以获得更大的好处。我们的目标平台是GPU，因为它的高性能能力和困难的编程挑战，可以通过适当的自动化来缓解。我们的方法- sage -结合了一个静态编译器，该编译器自动生成一组具有不同近似级别的CUDA内核，以及一个运行时系统，该系统迭代地选择可用内核以实现加速，同时坚持用户设置的目标输出质量。SAGE编译器使用三种优化技术来生成利用GPU微架构的近似内核:选择性丢弃原子操作、数据打包和线程融合。在一组机器学习和图像处理内核中，与NVIDIA GTX 560 GPU的精确执行相比，SAGE的近似平均产生2.5倍的加速，而质量损失不到10%。

{"title":"Scaling Performance via Self-Tuning Approximation for Graphics Engines","authors":"M. Samadi, Janghaeng Lee, D. Jamshidi, S. Mahlke, Amir Hormati","doi":"10.1145/2631913","DOIUrl":"https://doi.org/10.1145/2631913","url":null,"abstract":"Approximate computing, where computation accuracy is traded off for better performance or higher data throughput, is one solution that can help data processing keep pace with the current and growing abundance of information. For particular domains, such as multimedia and learning algorithms, approximation is commonly used today. We consider automation to be essential to provide transparent approximation, and we show that larger benefits can be achieved by constructing the approximation techniques to fit the underlying hardware. Our target platform is the GPU because of its high performance capabilities and difficult programming challenges that can be alleviated with proper automation. Our approach—SAGE—combines a static compiler that automatically generates a set of CUDA kernels with varying levels of approximation with a runtime system that iteratively selects among the available kernels to achieve speedup while adhering to a target output quality set by the user. The SAGE compiler employs three optimization techniques to generate approximate kernels that exploit the GPU microarchitecture: selective discarding of atomic operations, data packing, and thread fusion. Across a set of machine learning and image processing kernels, SAGE's approximation yields an average of 2.5× speedup with less than 10% quality loss compared to the accurate execution on a NVIDIA GTX 560 GPU.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"8 1","pages":"7:1-7:29"},"PeriodicalIF":1.5,"publicationDate":"2014-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89207525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

Energy Analysis of Hardware and Software Range Partitioning 硬件和软件范围划分的能量分析

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2014-09-23 DOI: 10.1145/2638550

Lisa Wu, Orestis Polychroniou, R. J. Barker, Martha A. Kim, K. A. Ross

Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency. The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning. For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.

数据分区是操作大型数据集的关键操作，因为它将任务细分为更易于高效处理的部分。它通常是数据库性能的限制因素，并且在大型数据查询的整体运行时中占很大一部分。本文测量了最先进的软件分区器的性能和能量，并描述和评估了进一步提高效率的硬件范围分区器。软件实现分为两个阶段，允许对配分函数计算和数据洗牌成本进行单独分析。虽然范围分区通常被认为比散列分区等简单策略更昂贵，但我们的测量表明，仔细的数据移动和优化分配函数可以使其接近散列或基数分区的吞吐量和能耗。为了进一步加速，我们描述了一个硬件范围分区器，或HARP，一个为这个和其他流加速器提供无缝执行环境的流框架，并详细分析了32nm物理设计，该设计与4到8个软件线程的吞吐量相匹配，而在同一技术世代中，仅消耗6.9%的面积和4.3%的功率至强核心。

{"title":"Energy Analysis of Hardware and Software Range Partitioning","authors":"Lisa Wu, Orestis Polychroniou, R. J. Barker, Martha A. Kim, K. A. Ross","doi":"10.1145/2638550","DOIUrl":"https://doi.org/10.1145/2638550","url":null,"abstract":"Data partitioning is a critical operation for manipulating large datasets because it subdivides tasks into pieces that are more amenable to efficient processing. It is often the limiting factor in database performance and represents a significant fraction of the overall runtime of large data queries. This article measures the performance and energy of state-of-the-art software partitioners, and describes and evaluates a hardware range partitioner that further improves efficiency.\u0000 The software implementation is broken into two phases, allowing separate analysis of the partition function computation and data shuffling costs. Although range partitioning is commonly thought to be more expensive than simpler strategies such as hash partitioning, our measurements indicate that careful data movement and optimization of the partition function can allow it to approach the throughput and energy consumption of hash or radix partitioning.\u0000 For further acceleration, we describe a hardware range partitioner, or HARP, a streaming framework that offers a seamless execution environment for this and other streaming accelerators, and a detailed analysis of a 32nm physical design that matches the throughput of four to eight software threads while consuming just 6.9% of the area and 4.3% of the power of a Xeon core in the same technology generation.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"63 1","pages":"8:1-8:24"},"PeriodicalIF":1.5,"publicationDate":"2014-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83055590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

GPUfs: integrating a file system with GPUs GPUfs:文件系统与图形处理器集成

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2014-02-01 DOI: 10.1145/2451116.2451169

M. Silberstein, B. Ford, I. Keidar, E. Witchel

PU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host's file system directly accessible from GPU code. GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adopted to use our file system, demonstrate the feasibility and benefits of our approach. For example, we demonstrate a simple self-contained GPU program which searches for a set of strings in the entire tree of Linux kernel source files over seven times faster than an eight-core CPU run.

PU硬件正变得越来越通用，迅速超越了传统的但受限制的gpu作为协处理器的编程模型。为了使GPU更容易编程，更容易与现有系统集成，我们建议让主机的文件系统直接从GPU代码访问。GPU为GPU程序提供了类似posix的API，利用GPU的并行性来提高效率，并通过将缓冲缓存扩展到GPU内存来优化GPU文件访问。我们的实验基于一组用于使用我们的文件系统的实际基准，证明了我们的方法的可行性和优点。例如，我们演示了一个简单的自包含GPU程序，它在Linux内核源文件的整个树中搜索一组字符串，比8核CPU运行速度快7倍。

引用次数: 69

Comprehensive formal verification of an OS microkernel 一个操作系统微内核的全面的正式验证

IF 1.5 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems

Pub Date : 2014-02-01 DOI: 10.1145/2560537

G. Klein, June Andronick, Kevin Elphinstone, Toby C. Murray, Thomas Sewell, Rafal Kolanski, G. Heiser

We present an in-depth coverage of the comprehensive machine-checked formal verification of seL4, a general-purpose operating system microkernel. We discuss the kernel design we used to make its verification tractable. We then describe the functional correctness proof of the kernel's C implementation and we cover further steps that transform this result into a comprehensive formal verification of the kernel: a formally verified IPC fastpath, a proof that the binary code of the kernel correctly implements the C semantics, a proof of correct access-control enforcement, a proof of information-flow noninterference, a sound worst-case execution time analysis of the binary, and an automatic initialiser for user-level systems that connects kernel-level access-control enforcement with reasoning about system behaviour. We summarise these results and show how they integrate to form a coherent overall analysis, backed by machine-checked, end-to-end theorems. The seL4 microkernel is currently not just the only general-purpose operating system kernel that is fully formally verified to this degree. It is also the only example of formal proof of this scale that is kept current as the requirements, design and implementation of the system evolve over almost a decade. We report on our experience in maintaining this evolving formally verified code base.

我们对seL4(一种通用的操作系统微内核)的全面的机器检查形式验证进行了深入的介绍。我们讨论了我们用来使其验证易于处理的内核设计。然后，我们描述了内核C实现的功能正确性证明，并介绍了将此结果转化为内核的全面形式化验证的进一步步骤:一个正式验证的IPC快速路径，一个证明内核的二进制代码正确实现C语义的证明，一个正确的访问控制强制执行的证明，一个信息流不干扰的证明，一个健全的二进制最坏情况执行时间分析，以及一个用户级系统的自动初始化程序，它将内核级访问控制强制执行与系统行为推理联系起来。我们总结了这些结果，并展示了它们如何整合成一个连贯的整体分析，由机器检查的端到端定理支持。seL4微内核不仅是目前唯一一个经过正式验证的通用操作系统内核。这也是唯一一个正式证明这一规模的例子，它在近十年来随着系统的需求、设计和实现的发展而保持最新。我们报告我们在维护这个不断发展的经过正式验证的代码库方面的经验。

{"title":"Comprehensive formal verification of an OS microkernel","authors":"G. Klein, June Andronick, Kevin Elphinstone, Toby C. Murray, Thomas Sewell, Rafal Kolanski, G. Heiser","doi":"10.1145/2560537","DOIUrl":"https://doi.org/10.1145/2560537","url":null,"abstract":"We present an in-depth coverage of the comprehensive machine-checked formal verification of seL4, a general-purpose operating system microkernel.\u0000 We discuss the kernel design we used to make its verification tractable. We then describe the functional correctness proof of the kernel's C implementation and we cover further steps that transform this result into a comprehensive formal verification of the kernel: a formally verified IPC fastpath, a proof that the binary code of the kernel correctly implements the C semantics, a proof of correct access-control enforcement, a proof of information-flow noninterference, a sound worst-case execution time analysis of the binary, and an automatic initialiser for user-level systems that connects kernel-level access-control enforcement with reasoning about system behaviour. We summarise these results and show how they integrate to form a coherent overall analysis, backed by machine-checked, end-to-end theorems.\u0000 The seL4 microkernel is currently not just the only general-purpose operating system kernel that is fully formally verified to this degree. It is also the only example of formal proof of this scale that is kept current as the requirements, design and implementation of the system evolve over almost a decade. We report on our experience in maintaining this evolving formally verified code base.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"1 1","pages":"2:1-2:70"},"PeriodicalIF":1.5,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79896857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 347