ACM Transactions on Computer Systems (TOCS)最新文献_第3页

Introduction to the Special Issue on the Award Papers of USENIX ATC 2019 USENIX ATC 2019颁奖论文特刊简介

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2020-05-30 DOI: 10.1145/3395034

D. Malkhi, Dan Tsafrir

This special issue of ACM Transactions on Computer Systems presents the three papers from the 2019 USENIX Annual Technical Conference (ATC’19) that won the Best Paper Award. The scope of ATC is broad. It covers all practical aspects related to systems software, and its goal is to improve and further the knowledge of computing systems of all scales, from small embedded devices to large data centers, while emphasizing implementations and experimental results. ATC underwent significant changes and improvements in 2019. ATC’19 received 356 submissions and accepted 71 (nearly 20%) through a double-blind, tworound review process in which each round 2 submission was reviewed by 5 to 6 program committee (PC) members. After the ATC’19 program was finalized, the Best Paper Award selection process proceeded in two phases. In the first phase, we combined several signals. One was an explicit ranking by reviewers marking papers worthy of consideration for the best paper, and any paper marked for such consideration by two or more PC members was passed to the second phase. Additionally, we considered general review ranks and deliberations (both online and during the PC meeting), moving several additional top-ranking papers to the second phase. Last, we collected explicit nominations by PC members for the Best Paper Award. At the end of the first phase, we generated a short-list of eight candidate papers. At this stage, we appointed a team of six PC members consisting of senior and experienced members of the systems research community. During a period of 4 weeks, the team read the papers, and we discussed each separately for best paper worthiness. We did not place a quota on the number of Best Paper Awards. Generally, the committee favored papers with original or surprising contributions, and/or ones that would spark interest and establish a new direction for follow-on works. At the end of the second stage, we elected three papers to receive the ATC’19 Best Paper Award, which are presented in this special issue. All three works include additional material relative to their conference version, which has been reviewed (in “fast-track mode”) by one or two of the original ATC’19 reviewers. The first work is “SILK+: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads” by Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona, which introduces techniques that manage to lower the 99th percentile latencies by up to two orders of magnitude relative to common log-structured merge key-value stores. The second work is “Transactuations: Where Transactions Meet the Physical World” by Aritra Sengupta, Tanakorn Leesatapornwongsa, Masoud Saeida Ardekani, and Cesar A. Stuardo, which uncovers IoT application inconsistencies manifesting out of failures and proposes a useful abstraction and execution model to combat this problem.

本期ACM计算机系统交易特刊介绍了2019年USENIX年度技术会议(ATC ' 19)获得最佳论文奖的三篇论文。空中交通管制的范围很广。它涵盖了与系统软件相关的所有实际方面，其目标是改进和进一步了解各种规模的计算系统，从小型嵌入式设备到大型数据中心，同时强调实现和实验结果。ATC在2019年经历了重大变化和改进。ATC ' 19共收到356份提案，其中71份(占比近20%)通过双盲两轮评审，每轮评审由5 - 6名项目委员会(PC)成员进行评审。在ATC ' 19项目完成后，最佳论文奖的评选过程分两个阶段进行。在第一阶段，我们结合了几个信号。一种是由审稿人对值得考虑的论文进行明确的排名，任何被两名或两名以上PC成员标记为值得考虑的论文都将进入第二阶段。此外，我们考虑了一般审查排名和审议(在线和PC会议期间)，将另外几篇排名最高的论文移至第二阶段。最后，我们收集了PC成员对最佳论文奖的明确提名。在第一阶段结束时，我们生成了一份包含八篇候选论文的简短清单。在这个阶段，我们任命了一个由六名PC成员组成的团队，其中包括系统研究社区的资深和有经验的成员。在4周的时间里，团队阅读论文，我们分别讨论每一篇论文的最佳价值。我们没有对最佳论文奖的数量设定配额。一般来说，委员会青睐那些具有原创性或惊人贡献的论文，以及/或那些能够激发兴趣并为后续工作确立新方向的论文。在第二阶段结束时，我们选出了三篇论文获得了ATC ' 19最佳论文奖，这些论文将在本期特刊中公布。所有这三部作品都包含了与会议版本相关的额外材料，这些材料已经由一到两位ATC ' 19的原始审稿人进行了审查(以“快速通道模式”)。第一篇文章是由Oana Balmau、Florin Dinu、Willy Zwaenepoel、Karan Gupta、Ravishankar Chandhiramoorthi和Diego Didona撰写的“SILK+:防止运行异构工作负载的日志结构化合并键值存储中的延迟峰值”，其中介绍了一些技术，这些技术可以将第99百分位延迟相对于常见的日志结构化合并键值存储降低两个量级。第二部作品是Aritra Sengupta, Tanakorn Leesatapornwongsa, Masoud Saeida Ardekani和Cesar a . Stuardo的“事务:事务与物理世界的交集”，它揭示了物联网应用的不一致性，并提出了一个有用的抽象和执行模型来解决这个问题。

{"title":"Introduction to the Special Issue on the Award Papers of USENIX ATC 2019","authors":"D. Malkhi, Dan Tsafrir","doi":"10.1145/3395034","DOIUrl":"https://doi.org/10.1145/3395034","url":null,"abstract":"This special issue of ACM Transactions on Computer Systems presents the three papers from the 2019 USENIX Annual Technical Conference (ATC’19) that won the Best Paper Award. The scope of ATC is broad. It covers all practical aspects related to systems software, and its goal is to improve and further the knowledge of computing systems of all scales, from small embedded devices to large data centers, while emphasizing implementations and experimental results. ATC underwent significant changes and improvements in 2019. ATC’19 received 356 submissions and accepted 71 (nearly 20%) through a double-blind, tworound review process in which each round 2 submission was reviewed by 5 to 6 program committee (PC) members. After the ATC’19 program was finalized, the Best Paper Award selection process proceeded in two phases. In the first phase, we combined several signals. One was an explicit ranking by reviewers marking papers worthy of consideration for the best paper, and any paper marked for such consideration by two or more PC members was passed to the second phase. Additionally, we considered general review ranks and deliberations (both online and during the PC meeting), moving several additional top-ranking papers to the second phase. Last, we collected explicit nominations by PC members for the Best Paper Award. At the end of the first phase, we generated a short-list of eight candidate papers. At this stage, we appointed a team of six PC members consisting of senior and experienced members of the systems research community. During a period of 4 weeks, the team read the papers, and we discussed each separately for best paper worthiness. We did not place a quota on the number of Best Paper Awards. Generally, the committee favored papers with original or surprising contributions, and/or ones that would spark interest and establish a new direction for follow-on works. At the end of the second stage, we elected three papers to receive the ATC’19 Best Paper Award, which are presented in this special issue. All three works include additional material relative to their conference version, which has been reviewed (in “fast-track mode”) by one or two of the original ATC’19 reviewers. The first work is “SILK+: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads” by Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona, which introduces techniques that manage to lower the 99th percentile latencies by up to two orders of magnitude relative to common log-structured merge key-value stores. The second work is “Transactuations: Where Transactions Meet the Physical World” by Aritra Sengupta, Tanakorn Leesatapornwongsa, Masoud Saeida Ardekani, and Cesar A. Stuardo, which uncovers IoT application inconsistencies manifesting out of failures and proposes a useful abstraction and execution model to combat this problem.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115850800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective Detection of Sleep-in-atomic-context Bugs in the Linux Kernel Linux内核中原子环境中睡眠错误的有效检测

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2020-04-14 DOI: 10.1145/3381990

Jia-Ju Bai, J. Lawall, Shimin Hu

Atomic context is an execution state of the Linux kernel in which kernel code monopolizes a CPU core. In this state, the Linux kernel may only perform operations that cannot sleep, as otherwise a system hang or crash may occur. We refer to this kind of concurrency bug as a sleep-in-atomic-context (SAC) bug. In practice, SAC bugs are hard to find, as they do not cause problems in all executions. In this article, we propose a practical static approach named DSAC to effectively detect SAC bugs in the Linux kernel. DSAC uses three key techniques: (1) a summary-based analysis to identify the code that may be executed in atomic context, (2) a connection-based alias analysis to identify the set of functions referenced by a function pointer, and (3) a path-check method to filter out repeated reports and false bugs. We evaluate DSAC on Linux 4.17 and find 1,159 SAC bugs. We manually check all the bugs and find that 1,068 bugs are real. We have randomly selected 300 of the real bugs and sent them to kernel developers. 220 of these bugs have been confirmed, and 51 of our patches fixing 115 bugs have been applied.

原子上下文是Linux内核的一种执行状态，在这种状态下内核代码独占一个CPU核心。在这种状态下，Linux内核只能执行不能休眠的操作，否则可能会发生系统挂起或崩溃。我们将这种并发性错误称为原子上下文中睡眠(SAC)错误。在实践中，SAC错误很难发现，因为它们不会在所有执行中造成问题。在本文中，我们提出了一种名为DSAC的实用静态方法来有效地检测Linux内核中的SAC错误。DSAC使用三个关键技术:(1)基于摘要的分析，以识别可能在原子上下文中执行的代码;(2)基于连接的别名分析，以识别由函数指针引用的函数集;(3)路径检查方法，以过滤重复报告和虚假错误。我们在Linux 4.17上评估了DSAC，发现了1159个SAC错误。我们手工检查了所有的bug，发现1068个bug是真实的。我们随机选择了300个真正的bug并发送给内核开发者。其中220个bug已经被确认，我们的51个补丁修复了115个bug。

引用次数: 4

An Instruction Set Architecture for Machine Learning 机器学习的指令集体系结构

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-08-13 DOI: 10.1145/3331469

Yunji Chen, Huiying Lan, Zidong Du, Shaoli Liu, Jinhua Tao, D. Han, Tao Luo, Qi Guo, Ling Li, Yuan Xie, Tianshi Chen

Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.

机器学习(ML)是用于从数据中学习以提高特定任务性能的一系列模型。ML技术，特别是最近更新的神经网络(深度神经网络)，已被证明在广泛的应用中是有效的。ML技术通常在通用处理器(如CPU和GPGPU)上执行，这通常不节能，因为它们投入了过多的硬件资源来灵活地支持各种工作负载。因此，最近提出了特定于应用程序的硬件加速器来提高能源效率。然而，这样的加速器是为一小部分共享类似计算模式的机器学习技术而设计的，它们采用复杂而信息丰富的指令(控制信号)，直接对应于机器学习技术的高级功能块(如神经网络中的层)，甚至是整个机器学习。尽管对于一组有限的类似机器学习技术来说，实现起来简单易行，但指令集缺乏灵活性阻碍了这种加速器设计以足够的灵活性和效率支持各种不同的机器学习技术。在本文中，我们首先提出了一种新的针对神经网络加速器的特定领域指令集架构(ISA)，称为“Cambricon”，这是一种基于对现有神经网络技术的综合分析，集成了标量、向量、矩阵、逻辑、数据传输和控制指令的负载存储架构。然后我们将寒武纪的应用范围从神经网络扩展到机器学习技术。我们还提出了一种汇编语言、汇编器和运行时，以支持使用寒武纪进行编程，特别是针对大规模机器学习问题。我们对16种具有代表性但不同的机器学习技术的评估表明，寒武纪在广泛的机器学习技术中表现出强大的描述能力，并且比通用isa(如x86, MIPS和GPGPU)提供更高的代码密度。与最新的最先进的神经网络加速器设计DaDianNao[7](只能容纳三种类型的神经网络技术)相比，我们采用台积电65nm技术实现的基于寒武纪的加速器原型只会产生可忽略的延迟/功耗/面积开销，具有10种不同的神经网络基准和7种其他ML基准的通用覆盖。

{"title":"An Instruction Set Architecture for Machine Learning","authors":"Yunji Chen, Huiying Lan, Zidong Du, Shaoli Liu, Jinhua Tao, D. Han, Tao Luo, Qi Guo, Ling Li, Yuan Xie, Tianshi Chen","doi":"10.1145/3331469","DOIUrl":"https://doi.org/10.1145/3331469","url":null,"abstract":"Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"468 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130201592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Retargetable System-level DBT Hypervisor 一个可重目标的系统级DBT Hypervisor

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-07-12 DOI: 10.1145/3386161

T. Spink, Harry Wagstaff, B. Franke

System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different from that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new architecture or extending an existing architecture are high. In this article, we develop a novel, retargetable DBT hypervisor, which includes guest-specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de facto standard QEMU DBT system. Our system delivers an average speedup of 2.21× over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA while delivering outstanding performance.

系统级动态二进制转换(DBT)提供引导操作系统(OS)和执行针对不同于主机的指令集体系结构(ISA)编译的程序的能力。由于它们的性能关键性质，系统级DBT框架通常是手工编码的，并针对其客户机和主机体系结构进行了大量优化。虽然这会带来DBT系统的良好性能，但支持新体系结构或扩展现有体系结构的工程成本很高。在本文中，我们将开发一种新颖的、可重新定位的DBT管理程序，其中包括根据高级客户机机器规范生成的客户机特定模块。我们的系统简化了DBT的重新定位，但它也提供了超过现有手动创建DBT解决方案的性能水平。我们通过结合离线和在线优化以及利用在虚拟机(VM)管理程序提供的裸机环境中运行的即时(JIT)编译器的自由来实现这一点。我们使用目标微基准测试和标准应用程序基准测试来评估DBT，并证明其性能优于事实上的标准QEMU DBT系统。我们的系统在运行于全系统Linux操作系统环境、针对64位ARMv8-A ISA编译并托管在x86-64平台上的SPEC CPU2006整数基准测试中，比QEMU提供了2.21倍的平均加速。对于浮点应用程序，加速甚至更高，平均达到6.49倍。我们证明了我们的系统级DBT系统显著减少了支持新ISA所需的工作，同时提供了出色的性能。

{"title":"A Retargetable System-level DBT Hypervisor","authors":"T. Spink, Harry Wagstaff, B. Franke","doi":"10.1145/3386161","DOIUrl":"https://doi.org/10.1145/3386161","url":null,"abstract":"System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different from that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new architecture or extending an existing architecture are high. In this article, we develop a novel, retargetable DBT hypervisor, which includes guest-specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de facto standard QEMU DBT system. Our system delivers an average speedup of 2.21× over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA while delivering outstanding performance.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127362836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

The Arm Triple Core Lock-Step (TCLS) Processor Arm三核锁步(TCLS)处理器

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-06-17 DOI: 10.1145/3323917

X. Iturbe, Balaji Venu, Emre Ozer, Jean-Luc Poupat, Gregoire Gimenez, Hans-Ulrich Zurek

The Arm Triple Core Lock-Step (TCLS) architecture is the natural evolution of Arm Cortex-R Dual Core Lock-Step (DCLS) processors to increase dependability, predictability, and availability in safety-critical and ultra-reliable applications. TCLS is simple, scalable, and easy to deploy in applications where Arm DCLS processors are widely used (e.g., automotive), as well as in new sectors where the presence of Arm technology is incipient (e.g., enterprise) or almost non-existent (e.g., space). Specifically in space, COTS Arm processors provide optimal power-to-performance, extensibility, evolvability, software availability, and ease of use, especially in comparison with the decades old rad-hard computing solutions that are still in use. This article discusses the fundamentals of an Arm Cortex-R5 based TCLS processor, providing key functioning and implementation details. The article shows that the TCLS architecture keeps the use of rad-hard technology to a minimum, namely, using rad-hard by design standard cell libraries only to protect the critical parts that account for less than 4% of the entire TCLS solution. Moreover, when exposure to radiation is relatively low, such as in terrestrial applications or even satellites operating in Low Earth Orbits (LEO), the system could be implemented entirely using commercial cell libraries, relying on the radiation mitigation methods implemented on the TCLS to cope with sporadic soft errors in its critical parts. The TCLS solution allows thus to significantly reduce chip manufacturing costs and keep pace with advances in low power consumption and high density integration by leveraging commercial semiconductor processes, while matching the reliability levels and improving availability that can be achieved using extremely expensive rad-hard semiconductor processes. Finally, the article describes a TRL4 proof-of-concept TCLS-based System-on-Chip (SoC) that has been prototyped and tested to power the computer on-board an Airbus Defence and Space telecom satellite. When compared to the currently used processor solution by Airbus, the TCLS-based SoC results in a more than 5× performance increase and cuts power consumption by more than half.

Arm三核锁步(TCLS)架构是Arm Cortex-R双核锁步(DCLS)处理器的自然演变，旨在提高安全性关键和超可靠应用的可靠性、可预测性和可用性。TCLS简单，可扩展，并且易于部署在广泛使用Arm DCLS处理器的应用程序中(例如，汽车)，以及在Arm技术刚刚出现的新领域(例如，企业)或几乎不存在的领域(例如，空间)。特别是在太空中，COTS Arm处理器提供了最佳的功率到性能、可扩展性、可演化性、软件可用性和易用性，特别是与仍在使用的几十年前的硬计算解决方案相比。本文讨论了基于Arm Cortex-R5的TCLS处理器的基本原理，提供了关键功能和实现细节。本文表明，TCLS体系结构将rad-hard技术的使用保持在最低限度，即，通过设计标准单元库使用rad-hard仅用于保护占整个TCLS解决方案不到4%的关键部分。此外，当辐射暴露相对较低时，例如在地面应用中，甚至在低地球轨道(LEO)上运行的卫星中，该系统可以完全使用商业小区库来实施，依靠在TCLS上实施的辐射减缓方法来应对其关键部分的零星软错误。因此，TCLS解决方案可以显著降低芯片制造成本，并通过利用商业半导体工艺来跟上低功耗和高密度集成的发展步伐，同时匹配可靠性水平并提高可用性，这可以使用极其昂贵的雷达硬半导体工艺来实现。最后，本文描述了基于tcls的TRL4概念验证片上系统(SoC)，该系统已经原型化并进行了测试，用于为空中客车国防和空间电信卫星上的计算机提供动力。与空客目前使用的处理器解决方案相比，基于tcls的SoC性能提高了5倍以上，功耗降低了一半以上。

{"title":"The Arm Triple Core Lock-Step (TCLS) Processor","authors":"X. Iturbe, Balaji Venu, Emre Ozer, Jean-Luc Poupat, Gregoire Gimenez, Hans-Ulrich Zurek","doi":"10.1145/3323917","DOIUrl":"https://doi.org/10.1145/3323917","url":null,"abstract":"The Arm Triple Core Lock-Step (TCLS) architecture is the natural evolution of Arm Cortex-R Dual Core Lock-Step (DCLS) processors to increase dependability, predictability, and availability in safety-critical and ultra-reliable applications. TCLS is simple, scalable, and easy to deploy in applications where Arm DCLS processors are widely used (e.g., automotive), as well as in new sectors where the presence of Arm technology is incipient (e.g., enterprise) or almost non-existent (e.g., space). Specifically in space, COTS Arm processors provide optimal power-to-performance, extensibility, evolvability, software availability, and ease of use, especially in comparison with the decades old rad-hard computing solutions that are still in use. This article discusses the fundamentals of an Arm Cortex-R5 based TCLS processor, providing key functioning and implementation details. The article shows that the TCLS architecture keeps the use of rad-hard technology to a minimum, namely, using rad-hard by design standard cell libraries only to protect the critical parts that account for less than 4% of the entire TCLS solution. Moreover, when exposure to radiation is relatively low, such as in terrestrial applications or even satellites operating in Low Earth Orbits (LEO), the system could be implemented entirely using commercial cell libraries, relying on the radiation mitigation methods implemented on the TCLS to cope with sporadic soft errors in its critical parts. The TCLS solution allows thus to significantly reduce chip manufacturing costs and keep pace with advances in low power consumption and high density integration by leveraging commercial semiconductor processes, while matching the reliability levels and improving availability that can be achieved using extremely expensive rad-hard semiconductor processes. Finally, the article describes a TRL4 proof-of-concept TCLS-based System-on-Chip (SoC) that has been prototyped and tested to power the computer on-board an Airbus Defence and Space telecom satellite. When compared to the currently used processor solution by Airbus, the TCLS-based SoC results in a more than 5× performance increase and cuts power consumption by more than half.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122521491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling 机架级内存池缓解分布式数据服务中的负载不平衡

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-04-09 DOI: 10.1145/3309986

Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, B. Falsafi, Boris Grot

To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack’s micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodes—RackOut_static—and another one based on an adaptive load balancing mechanism—RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static.

为了提供低延迟和高吞吐量保证，大多数大型键值存储将数据保存在许多服务器的内存中。尽管跨查找具有自然的并行性，但由键流行度分布的严重倾斜引入的负载不平衡限制了性能。为了避免违反尾部延迟服务水平目标，系统倾向于保持低服务器利用率，并将数据组织在微分片中，微分片提供迁移和复制单元，以实现负载平衡。这些技术减少了倾斜，但会带来额外的监视、数据复制和一致性维护开销。在这项工作中，我们介绍了RackOut，这是一种内存池技术，它利用新兴机架规模系统的单侧远程读取原语来缓解负载不平衡，同时尊重服务级目标。在RackOut中，数据以机架级粒度聚合，机架中的所有参与服务器共同服务于机架的所有微分片。我们开发了一个排队模型来评估RackOut在数据中心规模上的影响。此外，我们实现了一个RackOut概念验证键值存储，在基于RDMA和Scale-Out NUMA的两个实验平台上对其进行了评估，并使用这些结果来验证模型。我们在RackOut单元中设计了两种不同的负载平衡方法，一种是基于随机选择节点的rackout_static，另一种是基于自适应负载平衡机制的rackout_adaptive。我们的结果表明，与横向扩展部署相比，RackOut_static将RDMA的吞吐量提高了6倍，将横向扩展NUMA的吞吐量提高了8.6倍，同时尊重紧密的尾部延迟服务级别目标。RackOut_adaptive在写操作比RackOut_static多20%的情况下，将吞吐量提高了30%。

{"title":"Mitigating Load Imbalance in Distributed Data Serving with Rack-Scale Memory Pooling","authors":"Stanko Novakovic, Alexandros Daglis, Dmitrii Ustiugov, Edouard Bugnion, B. Falsafi, Boris Grot","doi":"10.1145/3309986","DOIUrl":"https://doi.org/10.1145/3309986","url":null,"abstract":"To provide low-latency and high-throughput guarantees, most large key-value stores keep the data in the memory of many servers. Despite the natural parallelism across lookups, the load imbalance, introduced by heavy skew in the popularity distribution of keys, limits performance. To avoid violating tail latency service-level objectives, systems tend to keep server utilization low and organize the data in micro-shards, which provides units of migration and replication for the purpose of load balancing. These techniques reduce the skew but incur additional monitoring, data replication, and consistency maintenance overheads. In this work, we introduce RackOut, a memory pooling technique that leverages the one-sided remote read primitive of emerging rack-scale systems to mitigate load imbalance while respecting service-level objectives. In RackOut, the data are aggregated at rack-scale granularity, with all of the participating servers in the rack jointly servicing all of the rack’s micro-shards. We develop a queuing model to evaluate the impact of RackOut at the datacenter scale. In addition, we implement a RackOut proof-of-concept key-value store, evaluate it on two experimental platforms based on RDMA and Scale-Out NUMA, and use these results to validate the model. We devise two distinct approaches to load balancing within a RackOut unit, one based on random selection of nodes—RackOut_static—and another one based on an adaptive load balancing mechanism—RackOut_adaptive. Our results show that RackOut_static increases throughput by up to 6× for RDMA and 8.6× for Scale-Out NUMA compared to a scale-out deployment, while respecting tight tail latency service-level objectives. RackOut_adaptive improves the throughput by 30% for workloads with 20% of writes over RackOut_static.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132492018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

SPIN 自旋

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-04-09 DOI: 10.1145/3309987

Shai Bergman, Tanya Brokhman, Tzachi Cohen, M. Silberstein

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses. We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID. We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.

最近的gpu支持来自NVMe ssd等快速外围设备的点对点直接内存访问(p2p)，从而将CPU排除在它们之间的数据路径之外，从而提高效率。不幸的是，使用p2p访问文件是具有挑战性的，因为低级非标准接口的微妙之处，它绕过了操作系统文件I/O层，可能会损害系统性能。开发人员必须掌握底层接口的知识，才能手动处理数据一致性和不对齐访问的微妙之处。我们提出SPIN，它将p2p集成到标准的OS文件I/O堆栈中，在适当的地方动态激活p2p，对用户透明。它将p2p与页面缓存访问相结合，重新启用顺序读取的预读，同时保持标准的POSIX FS一致性，跨gpu和ssd的可移植性，以及与软件RAID等虚拟块设备的兼容性。我们使用标准文件I/O基准测试、应用程序跟踪和端到端实验来评估NVIDIA和AMD gpu上的SPIN。SPIN在广泛的工作负载范围内实现了显著的性能加速，比p2p吞吐量高出一个数量级。它还通过动态适应其输入依赖的文件访问模式，将航空图像渲染应用程序的性能提高了2.6倍，使gpu加速的日志服务器的吞吐量提高了3.3倍，并使高度优化的gpu加速图像拼贴的执行速度提高了29%，仅更改了30行代码。

{"title":"SPIN","authors":"Shai Bergman, Tanya Brokhman, Tzachi Cohen, M. Silberstein","doi":"10.1145/3309987","DOIUrl":"https://doi.org/10.1145/3309987","url":null,"abstract":"Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses. We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID. We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"1351 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123362089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Derecho 权利

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-04-02 DOI: 10.1145/3302258

Sagar Jha, J. Behrens, Theo Gkountouvas, Mae Milano, Weijia Song, E. Tremel, R. V. Renesse, Sydney Zink, K. Birman

Cloud computing services often replicate data and may require ways to coordinate distributed actions. Here we present Derecho, a library for such tasks. The API provides interfaces for structuring applications into patterns of subgroups and shards, supports state machine replication within them, and includes mechanisms that assist in restart after failures. Running over 100Gbps RDMA, Derecho can send millions of events per second in each subgroup or shard and throughput peaks at 16GB/s, substantially outperforming prior solutions. Configured to run purely on TCP, Derecho is still substantially faster than comparable widely used, highly-tuned, standard tools. The key insight is that on modern hardware (including non-RDMA networks), data-intensive protocols should be built from non-blocking data-flow components.

云计算服务经常复制数据，并且可能需要协调分布式操作的方法。这里我们介绍Derecho，一个用于此类任务的库。该API提供了将应用程序结构化为子组和分片模式的接口，支持其中的状态机复制，并包括在故障后帮助重新启动的机制。运行超过100Gbps的RDMA, Derecho每秒可以在每个子组或分片中发送数百万个事件，吞吐量峰值为16GB/s，大大优于先前的解决方案。配置为完全在TCP上运行，Derecho仍然比类似的广泛使用的、高度优化的标准工具快得多。关键的观点是，在现代硬件(包括非rdma网络)上，数据密集型协议应该由非阻塞数据流组件构建。

引用次数: 15

Deca

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-03-14 DOI: 10.1145/3310361

Xuanhua Shi, Zhixiang Ke, Yongluan Zhou, Hai Jin, Lu Lu, Xiong Zhang, Ligang He, Zhenyu Hu, Fei Wang

In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca,1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2× to 22.7× speedup in terms of execution time in cases without data spilling and 16× to 41.6× speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems.

在Spark和Flink等大数据处理系统中，中间数据的内存缓存和shuffle缓冲区中数据的主动组合在最小化重计算和I/O成本方面非常有效。然而，也有广泛的报道称，这些技术将在堆中创建大量长期存在的数据对象。这些生成的对象可能很快使垃圾收集器饱和，特别是在处理大型数据集时，因此限制了系统的可伸缩性。为了消除这个问题，我们提出了一个基于生命周期的内存管理框架，该框架通过自动分析用户定义的函数和数据类型，获得数据对象的预期生命周期，然后相应地分配和释放内存空间，以最大限度地减少垃圾收集开销。特别地，我们展示了Deca,1，这是我们的建议在Spark之上的一个具体实现，它透明地将具有相似生命周期的对象分解和分组为字节数组，并在它们的生命周期结束时释放它们的空间。当系统处理非常大的数据时，Deca还提供面向字段的内存页，以确保高压缩效率。广泛使用合成和真实数据集的实验研究表明,在火花比较,十能(1)减少垃圾收集时间高达99.9%,(2)减少内存消耗46.6%和23.4%的存储空间,(3)达到1.2×22.7×加速执行时间的情况下没有数据溢出和16×41.6×加速的情况下,数据溢出,和(4)提供类似的性能相对于特定领域的系统。

{"title":"Deca","authors":"Xuanhua Shi, Zhixiang Ke, Yongluan Zhou, Hai Jin, Lu Lu, Xiong Zhang, Ligang He, Zhenyu Hu, Fei Wang","doi":"10.1145/3310361","DOIUrl":"https://doi.org/10.1145/3310361","url":null,"abstract":"In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca,1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2× to 22.7× speedup in terms of execution time in cases without data spilling and 16× to 41.6× speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123934765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Lock–Unlock

ACM Transactions on Computer Systems (TOCS)

Pub Date : 2019-03-14 DOI: 10.1145/3301501

R. Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, Vasileios Trigonakis

A plethora of optimized mutex lock algorithms have been designed over the past 25 years to mitigate performance bottlenecks related to critical sections and locks. Unfortunately, there is currently no broad study of the behavior of these optimized lock algorithms on realistic applications that consider different performance metrics, such as energy efficiency and tail latency. In this article, we perform a thorough and practical analysis of synchronization, with the goal of providing software developers with enough information to design fast, scalable, and energy-efficient synchronization in their systems. First, we perform a performance study of 28 state-of-the-art mutex lock algorithms, on 40 applications, on four different multicore machines. We consider not only throughput (traditionally the main performance metric) but also energy efficiency and tail latency, which are becoming increasingly important. Second, we present an in-depth analysis in which we summarize our findings for all the studied applications. In particular, we describe nine different lock-related performance bottlenecks, and we propose six guidelines helping software developers with their choice of a lock algorithm according to the different lock properties and the application characteristics. From our detailed analysis, we make several observations regarding locking algorithms and application behaviors, several of which have not been previously discovered: (i) applications stress not only the lock–unlock interface but also the full locking API (e.g., trylocks, condition variables); (ii) the memory footprint of a lock can directly affect the application performance; (iii) for many applications, the interaction between locks and scheduling is an important application performance factor; (vi) lock tail latencies may or may not affect application tail latency; (v) no single lock is systematically the best; (vi) choosing the best lock is difficult; and (vii) energy efficiency and throughput go hand in hand in the context of lock algorithms. These findings highlight that locking involves more considerations than the simple lock/unlock interface and call for further research on designing low-memory footprint adaptive locks that fully and efficiently support the full lock interface, and consider all performance metrics.

在过去的25年中，已经设计了大量优化的互斥锁算法来缓解与临界区和锁相关的性能瓶颈。不幸的是，目前还没有对这些优化的锁算法在考虑不同性能指标(如能源效率和尾部延迟)的实际应用程序中的行为进行广泛的研究。在本文中，我们对同步进行了全面而实际的分析，目的是为软件开发人员提供足够的信息，以便在他们的系统中设计快速、可伸缩且节能的同步。首先，我们在四台不同的多核机器上的40个应用程序上对28种最先进的互斥锁算法进行了性能研究。我们不仅考虑吞吐量(传统上主要的性能指标)，还考虑能源效率和尾部延迟，它们变得越来越重要。其次，我们提出了一个深入的分析，其中我们总结了我们对所有研究应用的发现。特别是，我们描述了9种不同的锁相关性能瓶颈，并提出了6条指导原则，帮助软件开发人员根据不同的锁属性和应用程序特征选择锁算法。从我们的详细分析中，我们对锁定算法和应用程序行为进行了一些观察，其中一些以前没有被发现:(i)应用程序不仅强调锁-解锁接口，而且强调全锁定API(例如，trylocks，条件变量);锁的内存占用会直接影响应用程序的性能;(iii)对于许多应用程序，锁和调度之间的交互是一个重要的应用程序性能因素;(vi)锁尾延迟可能影响也可能不影响应用程序尾部延迟;(v)没有一个锁是系统上最好的;(六)选择最佳锁困难;(vii)在锁算法的背景下，能源效率和吞吐量齐头并进。这些发现强调，锁定涉及到比简单的锁/解锁接口更多的考虑因素，并呼吁进一步研究如何设计低内存占用的自适应锁，以完全有效地支持全锁接口，并考虑所有性能指标。

{"title":"Lock–Unlock","authors":"R. Guerraoui, Hugo Guiroux, Renaud Lachaize, Vivien Quéma, Vasileios Trigonakis","doi":"10.1145/3301501","DOIUrl":"https://doi.org/10.1145/3301501","url":null,"abstract":"A plethora of optimized mutex lock algorithms have been designed over the past 25 years to mitigate performance bottlenecks related to critical sections and locks. Unfortunately, there is currently no broad study of the behavior of these optimized lock algorithms on realistic applications that consider different performance metrics, such as energy efficiency and tail latency. In this article, we perform a thorough and practical analysis of synchronization, with the goal of providing software developers with enough information to design fast, scalable, and energy-efficient synchronization in their systems. First, we perform a performance study of 28 state-of-the-art mutex lock algorithms, on 40 applications, on four different multicore machines. We consider not only throughput (traditionally the main performance metric) but also energy efficiency and tail latency, which are becoming increasingly important. Second, we present an in-depth analysis in which we summarize our findings for all the studied applications. In particular, we describe nine different lock-related performance bottlenecks, and we propose six guidelines helping software developers with their choice of a lock algorithm according to the different lock properties and the application characteristics. From our detailed analysis, we make several observations regarding locking algorithms and application behaviors, several of which have not been previously discovered: (i) applications stress not only the lock–unlock interface but also the full locking API (e.g., trylocks, condition variables); (ii) the memory footprint of a lock can directly affect the application performance; (iii) for many applications, the interaction between locks and scheduling is an important application performance factor; (vi) lock tail latencies may or may not affect application tail latency; (v) no single lock is systematically the best; (vi) choosing the best lock is difficult; and (vii) energy efficiency and throughput go hand in hand in the context of lock algorithms. These findings highlight that locking involves more considerations than the simple lock/unlock interface and call for further research on designing low-memory footprint adaptive locks that fully and efficiently support the full lock interface, and consider all performance metrics.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121355881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13