Proceedings of the Eleventh European Conference on Computer Systems最新文献_第6页

TAQ: enhancing fairness and performance predictability in small packet regimes TAQ:增强小包机制中的公平性和性能可预测性

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592819

Jay Chen, L. Subramanian, J. Iyengar, B. Ford

TCP congestion control algorithms implicitly assume that the per-flow throughput is at least a few packets per round trip time. Environments where this assumption does not hold, which we refer to as small packet regimes, are common in the contexts of wired and cellular networks in developing regions. In this paper we show that in small packet regimes TCP flows experience severe unfairness, high packet loss rates, and flow silences due to repetitive timeouts. We propose an approximate Markov model to describe TCP behavior in small packet regimes to characterize the TCP breakdown region that leads to repetitive timeout behavior. To enhance TCP performance in such regimes, we propose Timeout Aware Queuing (TAQ), a readily deployable in-network middlebox approach that uses a multi-level adaptive priority queuing algorithm to reduce the probability of timeouts, improve fairness and performance predictability. We demonstrate the effectiveness of TAQ across a spectrum of small packet regime network conditions using simulations, a prototype implementation, and testbed experiments.

TCP拥塞控制算法隐式地假设每流吞吐量是每次往返时间至少几个数据包。这种假设不成立的环境，我们称之为小数据包制度，在发展中地区的有线和蜂窝网络环境中很常见。在本文中，我们证明了在小数据包制度下，TCP流经历了严重的不公平，高丢包率和由于重复超时而导致的流沉默。我们提出了一个近似的马尔可夫模型来描述TCP在小数据包中的行为，以表征导致重复超时行为的TCP故障区域。为了在这种情况下提高TCP性能，我们提出了超时感知队列(TAQ)，这是一种易于部署的网络中间盒方法，它使用多级自适应优先级队列算法来降低超时概率，提高公平性和性能可预测性。我们通过模拟、原型实现和测试平台实验，证明了TAQ在小数据包制度网络条件下的有效性。

引用次数: 13

An aggressive worn-out flash block management scheme to alleviate SSD performance degradation 一个积极的磨损闪存块管理方案，以减轻SSD性能下降

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592818

Ping Huang, Guanying Wu, Xubin He, Weijun Xiao

Since NAND flash cannot be updated in place, SSDs must perform all writes in pre-erased pages. Consequently, pages containing superseded data must be invalidated and garbage collected. This garbage collection adds significant cost in terms of the extra writes necessary to relocate valid pages from erasure candidates to clean blocks, causing the well-known write amplification problem. SSDs reserve a certain amount of flash space which is invisible to users, called over-provisioning space, to alleviate the write amplification problem. However, NAND blocks can support only a limited number of program/erase cycles. As blocks are retired due to exceeding the limit, the reduced size of the over-provisioning pool leads to degraded SSD performance. In this work, we propose a novel system design that we call the Smart Retirement FTL (SR-FTL) to reuse the flash blocks which have been cycled to the maximum specified P/E endurance. We take advantage of the fact that the specified P/E limit guarantees data retention time of at least one year while most active data becomes stale in a period much shorter than one year, as observed in a variety of disk workloads. Our approach aggressively manages worn blocks to store data that requires only short retention time. In the meantime, the data reliability on worn blocks is carefully guaranteed. We evaluate the SR-FTL by both simulation on an SSD simulator and prototype implementation on an OpenSSD platform. Experimental results show that the SR-FTL successfully maintains consistent over-provisioning space levels as blocks wear and thus the degree of SSD performance degradation near end-of-life. In addition, we show that our scheme reduces block wear near end-of-life by as much as 84% in some scenarios.

由于NAND闪存不能就地更新，因此ssd必须在预擦除页面中执行所有写操作。因此，包含被替换数据的页面必须作废并进行垃圾收集。就将有效页面从擦除候选块重新定位到干净块所需的额外写操作而言，这种垃圾收集增加了显著的成本，从而导致众所周知的写放大问题。ssd保留一定数量的用户看不见的闪存空间，称为超额配置空间，以缓解写放大问题。然而，NAND块只能支持有限数量的程序/擦除周期。由于超出限制会导致块退役，过量分配池的大小减小会导致SSD性能下降。在这项工作中，我们提出了一种新的系统设计，我们称之为智能退役FTL (SR-FTL)，以重用已经循环到最大指定P/E耐力的闪存块。我们利用了这样一个事实，即指定的市盈率限制保证数据保留时间至少为一年，而大多数活动数据在短于一年的时间内变得陈旧，这在各种磁盘工作负载中都可以观察到。我们的方法积极管理磨损块，以存储只需要短保留时间的数据。同时，磨损块上的数据可靠性得到了很好的保证。我们通过在SSD模拟器上的仿真和在OpenSSD平台上的原型实现来评估SR-FTL。实验结果表明，SR-FTL成功地在块磨损时保持一致的超额分配空间水平，从而使SSD性能下降到接近寿命终止的程度。此外，我们还表明，在某些情况下，我们的方案可将区块磨损减少84%。

{"title":"An aggressive worn-out flash block management scheme to alleviate SSD performance degradation","authors":"Ping Huang, Guanying Wu, Xubin He, Weijun Xiao","doi":"10.1145/2592798.2592818","DOIUrl":"https://doi.org/10.1145/2592798.2592818","url":null,"abstract":"Since NAND flash cannot be updated in place, SSDs must perform all writes in pre-erased pages. Consequently, pages containing superseded data must be invalidated and garbage collected. This garbage collection adds significant cost in terms of the extra writes necessary to relocate valid pages from erasure candidates to clean blocks, causing the well-known write amplification problem. SSDs reserve a certain amount of flash space which is invisible to users, called over-provisioning space, to alleviate the write amplification problem. However, NAND blocks can support only a limited number of program/erase cycles. As blocks are retired due to exceeding the limit, the reduced size of the over-provisioning pool leads to degraded SSD performance.\u0000 In this work, we propose a novel system design that we call the Smart Retirement FTL (SR-FTL) to reuse the flash blocks which have been cycled to the maximum specified P/E endurance. We take advantage of the fact that the specified P/E limit guarantees data retention time of at least one year while most active data becomes stale in a period much shorter than one year, as observed in a variety of disk workloads. Our approach aggressively manages worn blocks to store data that requires only short retention time. In the meantime, the data reliability on worn blocks is carefully guaranteed. We evaluate the SR-FTL by both simulation on an SSD simulator and prototype implementation on an OpenSSD platform. Experimental results show that the SR-FTL successfully maintains consistent over-provisioning space levels as blocks wear and thus the degree of SSD performance degradation near end-of-life. In addition, we show that our scheme reduces block wear near end-of-life by as much as 84% in some scenarios.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"9 1","pages":"22:1-22:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88442076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Archiving cold data in warehouses with clustered network coding 用集群网络编码对仓库中的冷数据进行归档

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592816

Fabien André, Anne-Marie Kermarrec, E. L. Merrer, Nicolas Le Scouarnec, G. Straub, Alexandre van Kempen

Modern storage systems now typically combine plain replication and erasure codes to reliably store large amount of data in datacenters. Plain replication allows a fast access to popular data, while erasure codes, e.g., Reed-Solomon codes, provide a storage-efficient alternative for archiving less popular data. Although erasure codes are now increasingly employed in real systems, they experience high overhead during maintenance, i.e., upon failures, typically requiring files to be decoded before being encoded again to repair the encoded blocks stored at the faulty node. In this paper, we propose a novel erasure code system, tailored for networked archival systems. The efficiency of our approach relies on the joint use of random codes and a clustered placement strategy. Our repair protocol leverages network coding techniques to reduce by 50% the amount of data transferred during maintenance, by repairing several cluster files simultaneously. We demonstrate both through an analysis and extensive experimental study conducted on a public testbed that our approach significantly decreases both the bandwidth overhead during the maintenance process and the time to repair lost data. We also show that using a non-systematic code does not impact the throughput, and comes only at the price of a higher CPU usage. Based on these results, we evaluate the impact of this higher CPU consumption on different configurations of data coldness by determining whether the cluster's network bandwidth dedicated to repair or CPU dedicated to decoding saturates first.

现代存储系统现在通常结合了纯复制和擦除代码，以可靠地存储数据中心的大量数据。普通复制允许快速访问流行数据，而擦除代码，例如Reed-Solomon代码，为归档不太流行的数据提供了一种存储效率高的替代方案。尽管现在在实际系统中越来越多地使用擦除码，但在维护期间，即在出现故障时，通常需要先对文件进行解码，然后再对存储在故障节点上的编码块进行修复。在本文中，我们提出了一个新的擦除码系统，为网络化的档案系统量身定制。我们的方法的效率依赖于随机代码和聚类放置策略的联合使用。我们的修复协议利用网络编码技术，通过同时修复多个集群文件，在维护期间减少50%的数据传输量。我们通过在公共测试平台上进行的分析和广泛的实验研究证明，我们的方法显着降低了维护过程中的带宽开销和修复丢失数据的时间。我们还表明，使用非系统代码不会影响吞吐量，只会以更高的CPU使用率为代价。基于这些结果，我们通过确定集群专用于修复的网络带宽还是专用于解码的CPU首先饱和，来评估这种较高的CPU消耗对不同数据冷度配置的影响。

{"title":"Archiving cold data in warehouses with clustered network coding","authors":"Fabien André, Anne-Marie Kermarrec, E. L. Merrer, Nicolas Le Scouarnec, G. Straub, Alexandre van Kempen","doi":"10.1145/2592798.2592816","DOIUrl":"https://doi.org/10.1145/2592798.2592816","url":null,"abstract":"Modern storage systems now typically combine plain replication and erasure codes to reliably store large amount of data in datacenters. Plain replication allows a fast access to popular data, while erasure codes, e.g., Reed-Solomon codes, provide a storage-efficient alternative for archiving less popular data. Although erasure codes are now increasingly employed in real systems, they experience high overhead during maintenance, i.e., upon failures, typically requiring files to be decoded before being encoded again to repair the encoded blocks stored at the faulty node. In this paper, we propose a novel erasure code system, tailored for networked archival systems. The efficiency of our approach relies on the joint use of random codes and a clustered placement strategy. Our repair protocol leverages network coding techniques to reduce by 50% the amount of data transferred during maintenance, by repairing several cluster files simultaneously. We demonstrate both through an analysis and extensive experimental study conducted on a public testbed that our approach significantly decreases both the bandwidth overhead during the maintenance process and the time to repair lost data. We also show that using a non-systematic code does not impact the throughput, and comes only at the price of a higher CPU usage. Based on these results, we evaluate the impact of this higher CPU consumption on different configurations of data coldness by determining whether the cluster's network bandwidth dedicated to repair or CPU dedicated to decoding saturates first.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"11 1","pages":"21:1-21:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88614750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Reconciling high server utilization and sub-millisecond quality-of-service 协调高服务器利用率和亚毫秒级服务质量

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592821

J. Leverich, C. Kozyrakis

The simplest strategy to guarantee good quality of service (QoS) for a latency-sensitive workload with sub-millisecond latency in a shared cluster environment is to never run other workloads concurrently with it on the same server. Unfortunately, this inevitably leads to low server utilization, reducing both the capability and cost effectiveness of the cluster. In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads. We show that workload co-location leads to QoS violations due to increases in queuing delay, scheduling delay, and thread load imbalance. We present techniques that address these vulnerabilities, ranging from provisioning the latency-critical service in an interference aware manner, to replacing the Linux CFS scheduler with a scheduler that provides good latency guarantees and fairness for co-located workloads. Ultimately, we demonstrate that some latency-critical workloads can be aggressively co-located with other workloads, achieve good QoS, and that such co-location can improve a datacenter's effective throughput per TCO-$ by up to 52%.

在共享集群环境中，对于延迟敏感且延迟低于毫秒的工作负载，保证良好服务质量(QoS)的最简单策略是永远不要在同一台服务器上与它并发运行其他工作负载。不幸的是，这将不可避免地导致低服务器利用率，从而降低集群的能力和成本效益。在本文中，我们分析了在与其他工作负载共享服务器时为低延迟工作负载保持高QoS的挑战。我们表明，由于队列延迟、调度延迟和线程负载不平衡的增加，工作负载共定位会导致QoS违规。我们提出了解决这些漏洞的技术，从以干扰感知的方式提供延迟关键服务，到用一个为共置工作负载提供良好延迟保证和公平性的调度器替换Linux CFS调度器。最终，我们证明了一些延迟关键型工作负载可以积极地与其他工作负载共存，从而实现良好的QoS，并且这种共存可以将数据中心的每TCO-$的有效吞吐量提高52%。

{"title":"Reconciling high server utilization and sub-millisecond quality-of-service","authors":"J. Leverich, C. Kozyrakis","doi":"10.1145/2592798.2592821","DOIUrl":"https://doi.org/10.1145/2592798.2592821","url":null,"abstract":"The simplest strategy to guarantee good quality of service (QoS) for a latency-sensitive workload with sub-millisecond latency in a shared cluster environment is to never run other workloads concurrently with it on the same server. Unfortunately, this inevitably leads to low server utilization, reducing both the capability and cost effectiveness of the cluster.\u0000 In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads. We show that workload co-location leads to QoS violations due to increases in queuing delay, scheduling delay, and thread load imbalance. We present techniques that address these vulnerabilities, ranging from provisioning the latency-critical service in an interference aware manner, to replacing the Linux CFS scheduler with a scheduler that provides good latency guarantees and fairness for co-located workloads. Ultimately, we demonstrate that some latency-critical workloads can be aggressively co-located with other workloads, achieve good QoS, and that such co-location can improve a datacenter's effective throughput per TCO-$ by up to 52%.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"57 1","pages":"4:1-4:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90870462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 230

Relational access to Unix kernel data structures 对Unix内核数据结构的关系访问

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592802

Marios Fragkoulis, D. Spinellis, P. Louridas, A. Bilas

State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics. This work contributes a method and an implementation for mapping a kernel's data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel's data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module's source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries. We demonstrate PiCO QL's usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.

最先进的内核诊断工具(如DTrace和Systemtap)提供了用于表示分析任务的过程化接口。我们认为，内核数据结构的关系接口可以为内核诊断提供互补的好处。这项工作提供了将内核的数据结构映射到关系接口的方法和实现。Pico COllections Query Library (Pico QL) Linux内核模块使用特定于领域的语言来定义可访问的Linux内核数据结构的关系表示，使用解析器来分析定义，使用编译器来实现数据结构的SQL接口。然后，它根据内核的数据结构计算用SQL编写的查询。PiCO QL查询是交互式且类型安全的。与SystemTap和DTrace不同，PiCO QL的侵入性较小，因为它不需要内核插装;相反，它通过模块的源代码与现有的内核数据结构挂钩。PiCO QL在空闲时不会增加开销，只需要访问包含回答输入查询的相关信息的内核数据结构。我们通过展示Linux内核查询来展示PiCO QL的有用性，这些查询提供了有意义的系统资源自定义视图，并查明了安全漏洞和性能问题等问题。

{"title":"Relational access to Unix kernel data structures","authors":"Marios Fragkoulis, D. Spinellis, P. Louridas, A. Bilas","doi":"10.1145/2592798.2592802","DOIUrl":"https://doi.org/10.1145/2592798.2592802","url":null,"abstract":"State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics.\u0000 This work contributes a method and an implementation for mapping a kernel's data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel's data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module's source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries.\u0000 We demonstrate PiCO QL's usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"2016 1","pages":"12:1-12:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86502461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

System software for persistent memory 用于持久存储器的系统软件

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592814

Subramanya R. Dulloor, Sanjay Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, Jeffrey R. Jackson

Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability. Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.

新兴的字节寻址、非易失性存储器技术提供的性能在DRAM的一个数量级以内，促使它们被纳入处理器存储器子系统。然而，这样的加载/存储可访问持久性内存(PM)对系统设计有影响，包括硬件和软件。在本文中，我们探讨了系统软件支持，以使新的和遗留的应用程序能够进行低开销的PM访问。为此，我们实现了PMFS，这是一个轻量级的POSIX文件系统，它利用PM的字节可寻址性来避免面向块的存储的开销，并允许应用程序(使用内存映射的I/O)直接访问PM。PMFS利用处理器的分页和内存排序特性进行优化，例如细粒度日志记录(用于一致性)和透明大页面支持(用于更快的内存映射I/O)。为了提供强大的一致性保证，PMFS只需要一个简单的硬件原语，该原语为PM提供持久性和存储排序的软件强制保证。最后，PMFS使用处理器的现有特性来保护PM免受杂散写的影响，从而提高可靠性。使用硬件模拟器，我们在一系列PM性能特征上评估了PMFS在多个工作负载下的性能。在类似ramdisk的PM块设备上，PMFS比传统文件系统(如ext4)表现出显著的(高达一个数量级的)优势，这说明了为PM优化系统软件的好处。

{"title":"System software for persistent memory","authors":"Subramanya R. Dulloor, Sanjay Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, Jeffrey R. Jackson","doi":"10.1145/2592798.2592814","DOIUrl":"https://doi.org/10.1145/2592798.2592814","url":null,"abstract":"Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability.\u0000 Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"54 1","pages":"15:1-15:15"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88969626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 619

Chronos: a graph engine for temporal graph analysis Chronos:一个用于时间图分析的图引擎

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592799

Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, Enhong Chen

Temporal graphs capture changes in graphs over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on a time-evolving social graph. Chronos is a storage and execution engine designed and optimized specifically for running in-memory iterative graph computation on temporal graphs. Locality is at the center of the Chronos design, where the in-memory layout of temporal graphs and the scheduling of the iterative computation on temporal graphs are carefully designed, so that common "bulk" operations on temporal graphs are scheduled to maximize the benefit of in-memory data locality. The design of Chronos further explores the interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs. The result is a high-performance temporal-graph system that offers up to an order of magnitude speedup for temporal iterative graph mining compared to a straightforward application of existing graph engines on a series of snapshots.

时间图捕捉了图表随时间的变化，并且正在成为一个吸引研究界越来越感兴趣的主题，例如，在一个随时间变化的社会图上理解社会互动的时间特征。Chronos是一个存储和执行引擎，专为在时间图上运行内存迭代图计算而设计和优化。局部性是Chronos设计的核心，它精心设计了时间图的内存布局和时间图上迭代计算的调度，以便调度时间图上的常见“批量”操作，以最大限度地发挥内存数据局部性的好处。Chronos的设计进一步探索了局部性、并行性和增量计算之间有趣的相互作用，以支持时间图上的常见挖掘任务。结果是一个高性能的时间图系统，与现有的图引擎在一系列快照上的直接应用程序相比，它为时间迭代图挖掘提供了高达数量级的加速。

{"title":"Chronos: a graph engine for temporal graph analysis","authors":"Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, Enhong Chen","doi":"10.1145/2592798.2592799","DOIUrl":"https://doi.org/10.1145/2592798.2592799","url":null,"abstract":"Temporal graphs capture changes in graphs over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on a time-evolving social graph. Chronos is a storage and execution engine designed and optimized specifically for running in-memory iterative graph computation on temporal graphs. Locality is at the center of the Chronos design, where the in-memory layout of temporal graphs and the scheduling of the iterative computation on temporal graphs are carefully designed, so that common \"bulk\" operations on temporal graphs are scheduled to maximize the benefit of in-memory data locality. The design of Chronos further explores the interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs. The result is a high-performance temporal-graph system that offers up to an order of magnitude speedup for temporal iterative graph mining compared to a straightforward application of existing graph engines on a series of snapshots.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"5 1","pages":"1:1-1:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82128248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 191

T-Rex: a dynamic race detection tool for C/C++ transactional memory applications 用于C/ c++事务内存应用程序的动态竞争检测工具

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592809

Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran

Transactional memory (TM) has reached a maturity level and programmers have started using this programming model to parallelize their applications. However, although much effort has been put into the development of TM systems, there is still lack of debugging and development tools for TM applications, such as race detection tools. Previous definitions of transactional data race often impose constraints on the TM implementation or the programming language and cannot be widely applied to current STM designs. We propose a new definition of transactional data race that follows the programmer's intuition of racy accesses, is independent of thread interleaving, can accommodate popular STM systems, and allows common programming idioms. Based on this definition, we design and implement T-Rex, a precise dynamic race detection tool for C/C++ TM programs. Using T-Rex we discover transactional data races in STAMP applications that, to the best of our knowledge, have not been previously reported. Our experiments also show that T-Rex runtime overhead is comparable to state-of-the-art lock-based race detection tools, despite the extra work required to handle transactional memory semantics.

事务性内存(TM)已经达到了成熟的程度，程序员已经开始使用这种编程模型来并行化他们的应用程序。然而，尽管在TM系统的开发上投入了大量的精力，但TM应用程序的调试和开发工具，如种族检测工具，仍然缺乏。以前的事务性数据竞争定义常常对TM实现或编程语言施加约束，不能广泛应用于当前的STM设计。我们提出了事务性数据竞争的新定义，它遵循程序员对动态访问的直觉，独立于线程交错，可以适应流行的STM系统，并允许常见的编程习惯。基于这一定义，我们设计并实现了T-Rex，一个用于C/ c++ TM程序的精确动态竞争检测工具。使用T-Rex，我们发现了STAMP应用程序中的事务数据竞争，据我们所知，这些竞争以前没有被报道过。我们的实验还表明，T-Rex运行时开销与最先进的基于锁的竞争检测工具相当，尽管需要额外的工作来处理事务性内存语义。

{"title":"T-Rex: a dynamic race detection tool for C/C++ transactional memory applications","authors":"Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran","doi":"10.1145/2592798.2592809","DOIUrl":"https://doi.org/10.1145/2592798.2592809","url":null,"abstract":"Transactional memory (TM) has reached a maturity level and programmers have started using this programming model to parallelize their applications. However, although much effort has been put into the development of TM systems, there is still lack of debugging and development tools for TM applications, such as race detection tools.\u0000 Previous definitions of transactional data race often impose constraints on the TM implementation or the programming language and cannot be widely applied to current STM designs. We propose a new definition of transactional data race that follows the programmer's intuition of racy accesses, is independent of thread interleaving, can accommodate popular STM systems, and allows common programming idioms.\u0000 Based on this definition, we design and implement T-Rex, a precise dynamic race detection tool for C/C++ TM programs. Using T-Rex we discover transactional data races in STAMP applications that, to the best of our knowledge, have not been previously reported. Our experiments also show that T-Rex runtime overhead is comparable to state-of-the-art lock-based race detection tools, despite the extra work required to handle transactional memory semantics.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"28 1","pages":"20:1-20:12"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74859811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

StackTrack: an automated transactional approach to concurrent memory reclamation StackTrack:用于并发内存回收的自动事务性方法

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592808

Dan Alistarh, P. Eugster, M. Herlihy, A. Matveev, N. Shavit

Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to the specific data structure by the programmer, or both. This paper presents StackTrack, the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency. StackTrack eliminates most of the expensive bookkeeping required for memory reclamation by leveraging the power of hardware transactional memory (HTM) in a new way: it tracks thread variables dynamically, and in an atomic fashion. This effectively makes all memory references visible without having threads pay the overhead of writing out this information. Our empirical results show that this new approach matches or outperforms prior, non-automated, techniques.

动态内存回收可以说是并发数据结构设计中最大的开放问题:所有已知的解决方案都会导致高开销，或者必须由程序员定制特定的数据结构，或者两者兼而有之。本文提出了StackTrack，这是第一个可以由编译器在保持效率的情况下自动应用的并发内存回收方案。StackTrack以一种新的方式利用硬件事务性内存(HTM)的强大功能，消除了内存回收所需的大部分昂贵的簿记工作:它以原子方式动态地跟踪线程变量。这有效地使所有内存引用可见，而不需要线程支付写出这些信息的开销。我们的实证结果表明，这种新方法匹配或优于先前的非自动化技术。

引用次数: 58

WCMP: weighted cost multipathing for improved fairness in data centers WCMP:用于提高数据中心公平性的加权成本多路径

Proceedings of the Eleventh European Conference on Computer Systems

Pub Date : 2014-04-14 DOI: 10.1145/2592798.2592803

Junlan Zhou, Malveeka Tewari, Min Zhu, A. Kabbani, L. Poutievski, Arjun Singh, Amin Vahdat

Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows. We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP.

数据中心拓扑在服务器之间使用多条路径，以提供可扩展的、经济高效的网络容量。在这些路径之间进行负载平衡的最简单和最广泛部署的方法是等成本多路径(ECMP)，它在通往目的地的最短路径之间进行哈希流。ECMP利用均衡流大小的统一散列来实现数据中心的公平性和良好的负载平衡。然而，我们表明ECMP进一步假设了一个平衡、规则和无故障的拓扑结构，这在实践中是无效的假设，可能导致实质性的性能下降，更糟糕的是，即使对于相同大小的流，流带宽也会发生变化。我们提出了一组简单的算法来实现加权代价多路径(WCMP)，以根据不断变化的网络拓扑平衡数据中心的流量。WCMP所需的状态已经作为标准路由协议的一部分传播，并且可以很容易地在当前的交换机硅片中实现，而无需任何硬件修改。我们展示了如何在生产OpenFlow网络环境中部署WCMP，并提供了实验和仿真结果，以表明通过使用WCMP相对于ECMP可以将流量带宽的变化减少多达25倍。

{"title":"WCMP: weighted cost multipathing for improved fairness in data centers","authors":"Junlan Zhou, Malveeka Tewari, Min Zhu, A. Kabbani, L. Poutievski, Arjun Singh, Amin Vahdat","doi":"10.1145/2592798.2592803","DOIUrl":"https://doi.org/10.1145/2592798.2592803","url":null,"abstract":"Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows.\u0000 We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"84 1","pages":"5:1-5:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83340173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 170