Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems最新文献

英文中文

SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters SmoothOperator:减少大规模数据中心的电力碎片和提高电力利用率

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173190

Chang-Hong Hsu, Qingyuan Deng, Jason Mars, Lingjia Tang

With the ever growing popularity of cloud computing and web services, Internet companies are in need of increased computing capacity to serve the demand. However, power has become a major limiting factor prohibiting the growth in industry: it is often the case that no more servers can be added to datacenters without surpassing the capacity of the existing power infrastructure. In this work, we first investigate the power utilization in Facebook datacenters. We observe that the combination of provisioning for peak power usage, highly fluctuating traffic, and multi-level power delivery infrastructure leads to significant power budget fragmentation problem and inefficiently low power utilization. To address this issue, our insight is that heterogeneity of power consumption patterns among different services provides opportunities to re-shape the power profile of each power node by re-distributing services. By grouping services with asynchronous peak times under the same power node, we can reduce the peak power of each node and thus creating more power head-rooms to allow more servers hosted, achieving higher throughput. Based on this insight, we develop a workload-aware service placement framework to systematically spread the service instances with synchronous power patterns evenly under the power supply tree, greatly reducing the peak power draw at power nodes. We then leverage dynamic power profile reshaping to maximally utilize the headroom unlocked by our placement framework. Our experiments based on real production workload and power traces show that we are able to host up to 13% more machines in production, without changing the underlying power infrastructure. Utilizing the unleashed power headroom with dynamic reshaping, we achieve up to an estimated total of 15% and 11% throughput improvement for latency-critical service and batch service respectively at the same time, with up to 44% of energy slack reduction.

随着云计算和web服务的日益普及，互联网公司需要增加计算能力来满足需求。然而，电力已经成为阻碍行业增长的主要限制因素:通常情况下，如果不超过现有电力基础设施的容量，就无法向数据中心添加更多的服务器。在这项工作中，我们首先调查了Facebook数据中心的电力使用情况。我们观察到，峰值电力供应、高波动流量和多层次电力输送基础设施的结合导致了严重的电力预算碎片化问题和低效的低电力利用率。为了解决这个问题，我们的见解是，不同服务之间的电力消耗模式的异质性提供了通过重新分配服务来重新塑造每个电源节点的电力配置的机会。通过对同一功率节点下具有异步峰值时间的服务进行分组，我们可以降低每个节点的峰值功率，从而创建更多的功率头空间，以允许托管更多的服务器，从而实现更高的吞吐量。基于此，我们开发了一个工作负载感知的服务放置框架，系统地将具有同步电源模式的服务实例均匀地分布在电源树下，从而大大降低了电源节点的峰值功耗。然后，我们利用动态功率轮廓重塑，以最大限度地利用我们的放置框架解锁的净空空间。我们基于实际生产工作负载和电源跟踪的实验表明，在不改变底层电源基础设施的情况下，我们能够在生产中托管多达13%的机器。利用动态重塑释放的功率余量，我们同时为延迟关键型服务和批处理服务实现了高达15%和11%的吞吐量提高，并减少了高达44%的能量松弛。

{"title":"SmoothOperator: Reducing Power Fragmentation and Improving Power Utilization in Large-scale Datacenters","authors":"Chang-Hong Hsu, Qingyuan Deng, Jason Mars, Lingjia Tang","doi":"10.1145/3173162.3173190","DOIUrl":"https://doi.org/10.1145/3173162.3173190","url":null,"abstract":"With the ever growing popularity of cloud computing and web services, Internet companies are in need of increased computing capacity to serve the demand. However, power has become a major limiting factor prohibiting the growth in industry: it is often the case that no more servers can be added to datacenters without surpassing the capacity of the existing power infrastructure. In this work, we first investigate the power utilization in Facebook datacenters. We observe that the combination of provisioning for peak power usage, highly fluctuating traffic, and multi-level power delivery infrastructure leads to significant power budget fragmentation problem and inefficiently low power utilization. To address this issue, our insight is that heterogeneity of power consumption patterns among different services provides opportunities to re-shape the power profile of each power node by re-distributing services. By grouping services with asynchronous peak times under the same power node, we can reduce the peak power of each node and thus creating more power head-rooms to allow more servers hosted, achieving higher throughput. Based on this insight, we develop a workload-aware service placement framework to systematically spread the service instances with synchronous power patterns evenly under the power supply tree, greatly reducing the peak power draw at power nodes. We then leverage dynamic power profile reshaping to maximally utilize the headroom unlocked by our placement framework. Our experiments based on real production workload and power traces show that we are able to host up to 13% more machines in production, without changing the underlying power infrastructure. Utilizing the unleashed power headroom with dynamic reshaping, we achieve up to an estimated total of 15% and 11% throughput improvement for latency-critical service and batch service respectively at the same time, with up to 44% of energy slack reduction.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122724468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects MAERI:通过可重构互连在DNN加速器上实现灵活的数据流映射

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173176

Hyoukjun Kwon, A. Samajdar, T. Krishna

Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.

深度神经网络(DNN)在计算机视觉和语音识别方面表现出了非常有前途的成果，并正在成为无处不在的人工智能的基础。这些算法的计算复杂性和对高能效的需求导致了硬件加速器研究的激增。%用于此范例。为了减少访问DRAM的延迟和能源成本，大多数DNN加速器本质上是空间的，具有数百个处理元素(PE)并行运行并直接相互通信。dnn正在快速发展，在最新的拓扑结构中，具有不同输入和过滤器大小的卷积、循环、池化和完全连接层是很常见的。它们可能密集，也可能稀疏。它们还可以以无数种方式(在层内和跨层)进行分区，以利用数据重用(权重和中间输出)。上述所有因素都可能导致加速器衬底内的不同数据流模式。不幸的是，大多数DNN加速器在内部只支持固定的数据流模式，因为它们执行pe和片上网络(NoC)的仔细协同设计。事实上，它们中的大多数只针对卷积层内的流量进行了优化。这使得在结构上有效地映射任意数据流变得具有挑战性，并且可能导致可用计算资源的利用不足。深度神经网络加速器需要可编程以实现大规模部署。为了使它们可编程，它们需要在内部进行配置，以支持可以映射到它们上的各种数据流模式。为了满足这一需求，我们提出了MAERI，这是一个DNN加速器，由一组模块化和可配置的构建块构建，可以通过适当配置微小开关轻松支持无数DNN分区和映射。MAERI在刚性NoC结构的基线上跨多个数据流映射提供了8-459%的更好利用率。

{"title":"MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects","authors":"Hyoukjun Kwon, A. Samajdar, T. Krishna","doi":"10.1145/3173162.3173176","DOIUrl":"https://doi.org/10.1145/3173162.3173176","url":null,"abstract":"Deep neural networks (DNN) have demonstrated highly promising results across computer vision and speech recognition, and are becoming foundational for ubiquitous AI. The computational complexity of these algorithms and a need for high energy-efficiency has led to a surge in research on hardware accelerators. % for this paradigm. To reduce the latency and energy costs of accessing DRAM, most DNN accelerators are spatial in nature, with hundreds of processing elements (PE) operating in parallel and communicating with each other directly. DNNs are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.They may be dense or sparse. They can also be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs). All of the above can lead to different dataflow patterns within the accelerator substrate. Unfortunately, most DNN accelerators support only fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. DNN accelerators need to be programmable to enable mass deployment. For them to be programmable, they need to be configurable internally to support the various dataflow patterns that could be mapped over them. To address this need, we present MAERI, which is a DNN accelerator built with a set of modular and configurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring tiny switches. MAERI provides 8-459% better utilization across multiple dataflow mappings over baselines with rigid NoC fabrics.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124092023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 323

Sulong, and Thanks for All the Bugs: Finding Errors in C Programs by Abstracting from the Native Execution Model 《感谢所有的bug:通过从本地执行模型中抽象来发现C程序中的错误》

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173174

Manuel Rigger, Roland Schatz, R. Mayrhofer, Matthias Grimmer, H. Mössenböck

In C, memory errors, such as buffer overflows, are among the most dangerous software errors; as we show, they are still on the rise. Current dynamic bug-finding tools that try to detect such errors are based on the low-level execution model of the underlying machine. They insert additional checks in an ad-hoc fashion, which makes them prone to omitting checks for corner cases. To address this, we devised a novel approach to finding bugs during the execution of a program. At the core of this approach is an interpreter written in a high-level language that performs automatic checks (such as bounds, NULL, and type checks). By mapping data structures in C to those of the high-level language, accesses are automatically checked and bugs discovered. We have implemented this approach and show that our tool (called Safe Sulong) can find bugs that state-of-the-art tools overlook, such as out-of-bounds accesses to the main function arguments.

在C语言中，内存错误，如缓冲区溢出，是最危险的软件错误之一;正如我们所示，它们仍在上升。当前试图检测此类错误的动态bug查找工具是基于底层机器的低级执行模型的。他们以一种特别的方式插入额外的检查，这使得他们很容易忽略对极端情况的检查。为了解决这个问题，我们设计了一种在程序执行过程中查找bug的新方法。这种方法的核心是一个用高级语言编写的解释器，它执行自动检查(例如边界、NULL和类型检查)。通过将C语言中的数据结构映射到高级语言中的数据结构，可以自动检查访问并发现错误。我们已经实现了这种方法，并展示了我们的工具(称为Safe Sulong)可以发现最先进的工具所忽略的错误，例如对main函数参数的越界访问。

引用次数: 13

Skyway: Connecting Managed Heaps in Distributed Big Data Systems Skyway:连接分布式大数据系统中的托管堆

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173200

Khanh Nguyen, Lu Fang, Christian Navasca, G. Xu, Brian Demsky, Shan Lu

Managed languages such as Java and Scala are prevalently used in development of large-scale distributed systems. Under the managed runtime, when performing data transfer across machines, a task frequently conducted in a Big Data system, the system needs to serialize a sea of objects into a byte sequence before sending them over the network. The remote node receiving the bytes then deserializes them back into objects. This process is both performance-inefficient and labor-intensive: (1) object serialization/deserialization makes heavy use of reflection, an expensive runtime operation and/or (2) serialization/deserialization functions need to be hand-written and are error-prone. This paper presents Skyway, a JVM-based technique that can directly connect managed heaps of different (local or remote) JVM processes. Under Skyway, objects in the source heap can be directly written into a remote heap without changing their formats. Skyway provides performance benefits to any JVM-based system by completely eliminating the need (1) of invoking serialization/deserialization functions, thus saving CPU time, and (2) of requiring developers to hand-write serialization functions.

托管语言(如Java和Scala)广泛用于大规模分布式系统的开发。在托管运行时下，当执行跨机器的数据传输(这是大数据系统中经常执行的任务)时，系统需要在通过网络发送之前将大量对象序列化成字节序列。接收字节的远程节点然后将它们反序列化回对象。这个过程的性能效率很低，而且劳动密集型:(1)对象序列化/反序列化大量使用反射，这是一个昂贵的运行时操作和/或(2)序列化/反序列化函数需要手工编写，而且容易出错。本文介绍了Skyway，这是一种基于JVM的技术，可以直接连接不同(本地或远程)JVM进程的托管堆。在Skyway中，源堆中的对象可以直接写入远程堆，而无需更改其格式。Skyway通过完全消除(1)调用序列化/反序列化函数(从而节省CPU时间)和(2)要求开发人员手工编写序列化函数，为任何基于jvm的系统提供了性能优势。

{"title":"Skyway: Connecting Managed Heaps in Distributed Big Data Systems","authors":"Khanh Nguyen, Lu Fang, Christian Navasca, G. Xu, Brian Demsky, Shan Lu","doi":"10.1145/3173162.3173200","DOIUrl":"https://doi.org/10.1145/3173162.3173200","url":null,"abstract":"Managed languages such as Java and Scala are prevalently used in development of large-scale distributed systems. Under the managed runtime, when performing data transfer across machines, a task frequently conducted in a Big Data system, the system needs to serialize a sea of objects into a byte sequence before sending them over the network. The remote node receiving the bytes then deserializes them back into objects. This process is both performance-inefficient and labor-intensive: (1) object serialization/deserialization makes heavy use of reflection, an expensive runtime operation and/or (2) serialization/deserialization functions need to be hand-written and are error-prone. This paper presents Skyway, a JVM-based technique that can directly connect managed heaps of different (local or remote) JVM processes. Under Skyway, objects in the source heap can be directly written into a remote heap without changing their formats. Skyway provides performance benefits to any JVM-based system by completely eliminating the need (1) of invoking serialization/deserialization functions, thus saving CPU time, and (2) of requiring developers to hand-write serialization functions.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127271310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Automatic Hierarchical Parallelization of Linear Recurrences 线性递归的自动分层并行化

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173168

Sepideh Maleki, Martin Burtscher

Linear recurrences encompass many fundamental computations including prefix sums and digital filters. Later result values depend on earlier result values in recurrences, making it a challenge to compute them in parallel. We present a new work- and space-efficient algorithm to compute linear recurrences that is amenable to automatic parallelization and suitable for hierarchical massively-parallel architectures such as GPUs. We implemented our approach in a domain-specific code generator that emits optimized CUDA code. Our evaluation shows that, for standard prefix sums and single-stage IIR filters, the generated code reaches the throughput of memory copy for large inputs, which cannot be surpassed. On higher-order prefix sums, it performs nearly as well as the fastest handwritten code from the literature. On tuple-based prefix sums and digital filters, our automatically parallelized code outperforms the fastest prior implementations.

线性递归包含许多基本的计算，包括前缀和和数字滤波器。后期的结果值依赖于递归中早期的结果值，这使得并行计算它们成为一项挑战。我们提出了一种新的工作效率和空间效率高的算法来计算线性递归，该算法适用于自动并行化，并适用于gpu等分层大规模并行架构。我们在一个特定领域的代码生成器中实现了我们的方法，该生成器会发出优化的CUDA代码。我们的评估表明，对于标准前缀和和单阶段IIR过滤器，生成的代码达到了大输入的内存复制吞吐量，这是无法超越的。在高阶前缀和上，它的性能几乎和文献中最快的手写代码一样好。在基于元组的前缀和和数字过滤器上，我们的自动并行代码比之前最快的实现性能更好。

引用次数: 7

Session details: Session 8B: Potpourri 会议详情:8B:百花香

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3252967

Yan Solihin

引用次数: 0

FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware FirmUp:固件中常见漏洞的精确静态检测

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177157

Yaniv David, Nimrod Partush, Eran Yahav

We present a static, precise, and scalable technique for finding CVEs (Common Vulnerabilities and Exposures) in stripped firmware images. Our technique is able to efficiently find vulnerabilities in real-world firmware with high accuracy. Given a vulnerable procedure in an executable binary and a firmware image containing multiple stripped binaries, our goal is to detect possible occurrences of the vulnerable procedure in the firmware image. Due to the variety of architectures and unique tool chains used by vendors, as well as the highly customized nature of firmware, identifying procedures in stripped firmware is extremely challenging. Vulnerability detection requires not only pairwise similarity between procedures but also information about the relationships between procedures in the surrounding executable. This observation serves as the foundation for a novel technique that establishes a partial correspondence between procedures in the two binaries. We implemented our technique in a tool called FirmUp and performed an extensive evaluation over 40 million procedures, over 4 different prevalent architectures, crawled from public vendor firmware images. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device. A thorough comparison of FirmUp to previous methods shows that it accurately and effectively finds vulnerabilities in firmware, while outperforming the detection rate of the state of the art by 45% on average.

我们提出了一种静态、精确和可扩展的技术，用于在剥离固件映像中查找cve(常见漏洞和暴露)。我们的技术能够在真实世界的固件中以高精度有效地发现漏洞。给定可执行二进制文件中的一个易受攻击过程和包含多个剥离二进制文件的固件映像，我们的目标是检测固件映像中可能出现的易受攻击过程。由于供应商使用的各种架构和独特的工具链，以及固件的高度定制性，在剥离固件中识别程序是极具挑战性的。漏洞检测不仅需要过程之间的成对相似性，还需要关于周围可执行程序中过程之间关系的信息。这一观察结果为一种新技术奠定了基础，该技术在两个二进制文件中的过程之间建立了部分对应关系。我们在一个名为FirmUp的工具中实现了我们的技术，并对从公共供应商固件映像中抓取的4种不同的流行架构的4000多万个程序进行了广泛的评估。我们发现了373个影响公开可用固件的漏洞，其中147个存在于设备的最新可用固件版本中。将FirmUp与之前的方法进行全面比较表明，它可以准确有效地发现固件中的漏洞，同时平均比现有技术的检测率高出45%。

{"title":"FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware","authors":"Yaniv David, Nimrod Partush, Eran Yahav","doi":"10.1145/3173162.3177157","DOIUrl":"https://doi.org/10.1145/3173162.3177157","url":null,"abstract":"We present a static, precise, and scalable technique for finding CVEs (Common Vulnerabilities and Exposures) in stripped firmware images. Our technique is able to efficiently find vulnerabilities in real-world firmware with high accuracy. Given a vulnerable procedure in an executable binary and a firmware image containing multiple stripped binaries, our goal is to detect possible occurrences of the vulnerable procedure in the firmware image. Due to the variety of architectures and unique tool chains used by vendors, as well as the highly customized nature of firmware, identifying procedures in stripped firmware is extremely challenging. Vulnerability detection requires not only pairwise similarity between procedures but also information about the relationships between procedures in the surrounding executable. This observation serves as the foundation for a novel technique that establishes a partial correspondence between procedures in the two binaries. We implemented our technique in a tool called FirmUp and performed an extensive evaluation over 40 million procedures, over 4 different prevalent architectures, crawled from public vendor firmware images. We discovered 373 vulnerabilities affecting publicly available firmware, 147 of them in the latest available firmware version for the device. A thorough comparison of FirmUp to previous methods shows that it accurately and effectively finds vulnerabilities in firmware, while outperforming the detection rate of the state of the art by 45% on average.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"349 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123942054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

FCatch: Automatically Detecting Time-of-fault Bugs in Cloud Systems FCatch:自动检测云系统中的故障时间错误

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177161

Haopeng Liu, Xu Wang, Guangpu Li, Shan Lu, Feng Ye, Chen Tian

It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.

分布式系统实现高可用性是至关重要的。不幸的是，考虑到常见的组件故障(即错误)，这是具有挑战性的。开发人员通常无法预测可能发生故障的所有时间条件和系统状态，并引入故障时间(TOF)错误，这些错误仅在节点崩溃或在特定时刻丢失消息时才会出现。尽管具有挑战性，但检测TOF错误是开发高可用性分布式系统的基础。与以往的工作依赖于错误注入来暴露TOF错误不同，本文将TOF错误仔细地建模为一种新型的并发错误，并开发了FCatch，通过观察正确执行来自动预测TOF错误。对代表性云系统的评估表明，FCatch是有效的，可以准确地发现严重的TOF漏洞。

引用次数: 26

Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing 内存集群计算的数据感知高维配置自动调优

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173187

Zhibin Yu, Zhendong Bei, Xuehai Qian

In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (ODC) frameworks for iterative and interactive applications. Like ODC, IMC frameworks typically run the same given programs repeatedly on a given cluster with similar input dataset size each time. It is challenging to build performance model for IMC program because: 1) the performance of IMC programs is more sensitive to the size of input dataset, which is known to be difficult to be incorporated into a performance model due to its complex effects on performance; 2) the number of performance-critical configuration parameters in IMC is much larger than ODC (more than 40 vs. around 10), the high dimensionality requires more sophisticated models to achieve high accuracy. To address this challenge, we propose DAC, a datasize-aware auto-tuning approach to efficiently identify the high dimensional configuration for a given IMC program to achieve optimal performance on a given cluster. DAC is a significant advance over the state-of-the-art because it can take the size of input dataset and 41 configuration parameters as the parameters of the performance model for a given IMC program, --- unprecedented in previous work. It is made possible by two key techniques: 1) Hierarchical Modeling (HM), which combines a number of individual sub-models in a hierarchical manner; 2) Genetic Algorithm (GA) is employed to search the optimal configuration. To evaluate DAC, we use six typical Spark programs, each with five different input dataset sizes. The evaluation results show that DAC improves the performance of six typical Spark programs, each with five different input dataset sizes compared to default configurations by a factor of 30.4x on average and up to 89x. We also report that the geometric mean speedups of DAC over configurations by default, expert, and RFHOC are 15.4x, 2.3x, and 1.5x, respectively.

内存集群计算(IMC)框架(例如Spark)已经变得越来越重要，因为对于迭代和交互式应用程序，它们通常比传统的磁盘集群计算(ODC)框架实现10倍以上的加速。与ODC一样，IMC框架通常在每次输入数据集大小相似的给定集群上重复运行相同的给定程序。由于IMC程序的性能对输入数据集的大小较为敏感，而输入数据集对性能的影响较为复杂，难以纳入到性能模型中，因此IMC程序的性能模型构建具有一定的挑战性;2) IMC中性能关键配置参数的数量远远大于ODC(40多个vs. 10个左右)，高维需要更复杂的模型来实现高精度。为了解决这一挑战，我们提出了DAC，这是一种数据感知的自动调优方法，可以有效地识别给定IMC程序的高维配置，从而在给定集群上实现最佳性能。DAC是最先进技术的重大进步，因为它可以将输入数据集的大小和41个配置参数作为给定IMC程序的性能模型的参数，这在以前的工作中是前所未有的。这是由两个关键技术实现的:1)分层建模(HM)，它以分层的方式组合了许多单独的子模型;2)采用遗传算法(GA)搜索最优配置。为了评估DAC，我们使用了六个典型的Spark程序，每个程序都有五个不同的输入数据集大小。评估结果表明，与默认配置相比，DAC提高了六个典型Spark程序的性能，每个程序具有五个不同的输入数据集大小，平均提高了30.4倍，最高提高了89倍。我们还报告说，与默认配置、expert配置和RFHOC配置相比，DAC的几何平均加速分别为15.4倍、2.3倍和1.5倍。

{"title":"Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing","authors":"Zhibin Yu, Zhendong Bei, Xuehai Qian","doi":"10.1145/3173162.3173187","DOIUrl":"https://doi.org/10.1145/3173162.3173187","url":null,"abstract":"In-Memory cluster Computing (IMC) frameworks (e.g., Spark) have become increasingly important because they typically achieve more than 10× speedups over the traditional On-Disk cluster Computing (ODC) frameworks for iterative and interactive applications. Like ODC, IMC frameworks typically run the same given programs repeatedly on a given cluster with similar input dataset size each time. It is challenging to build performance model for IMC program because: 1) the performance of IMC programs is more sensitive to the size of input dataset, which is known to be difficult to be incorporated into a performance model due to its complex effects on performance; 2) the number of performance-critical configuration parameters in IMC is much larger than ODC (more than 40 vs. around 10), the high dimensionality requires more sophisticated models to achieve high accuracy. To address this challenge, we propose DAC, a datasize-aware auto-tuning approach to efficiently identify the high dimensional configuration for a given IMC program to achieve optimal performance on a given cluster. DAC is a significant advance over the state-of-the-art because it can take the size of input dataset and 41 configuration parameters as the parameters of the performance model for a given IMC program, --- unprecedented in previous work. It is made possible by two key techniques: 1) Hierarchical Modeling (HM), which combines a number of individual sub-models in a hierarchical manner; 2) Genetic Algorithm (GA) is employed to search the optimal configuration. To evaluate DAC, we use six typical Spark programs, each with five different input dataset sizes. The evaluation results show that DAC improves the performance of six typical Spark programs, each with five different input dataset sizes compared to default configurations by a factor of 30.4x on average and up to 89x. We also report that the geometric mean speedups of DAC over configurations by default, expert, and RFHOC are 15.4x, 2.3x, and 1.5x, respectively.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126092941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

Frightening Small Children and Disconcerting Grown-ups: Concurrency in the Linux Kernel 让小孩子害怕，让成年人不安:Linux内核中的并发性

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177156

J. Alglave, Luc Maranget, P. McKenney, A. Parri, A. S. Stern

Concurrency in the Linux kernel can be a contentious topic. The Linux kernel mailing list features numerous discussions related to consistency models, including those of the more than 30 CPU architectures supported by the kernel and that of the kernel itself. How are Linux programs supposed to behave? Do they behave correctly on exotic hardware? A formal model can help address such questions. Better yet, an executable model allows programmers to experiment with the model to develop their intuition. Thus we offer a model written in the cat language, making it not only formal, but also executable by the herd simulator. We tested our model against hardware and refined it in consultation with maintainers. Finally, we formalised the fundamental law of the Read-Copy-Update synchronisation mechanism, and proved that one of its implementations satisfies this law.

Linux内核中的并发性可能是一个有争议的话题。Linux内核邮件列表提供了大量与一致性模型相关的讨论，包括内核支持的30多种CPU架构和内核本身的架构。Linux程序应该如何表现?它们在外来硬件上的行为是否正确?一个正式的模型可以帮助解决这样的问题。更好的是，可执行模型允许程序员对模型进行实验，以发展他们的直觉。因此，我们提供了一个用cat语言编写的模型，使其不仅形式化，而且可由羊群模拟器执行。我们针对硬件测试了我们的模型，并与维护人员协商对其进行了改进。最后，我们形式化了Read-Copy-Update同步机制的基本规律，并证明了它的一个实现满足这一规律。

引用次数: 43

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀