TCP congestion control algorithms implicitly assume that the per-flow throughput is at least a few packets per round trip time. Environments where this assumption does not hold, which we refer to as small packet regimes, are common in the contexts of wired and cellular networks in developing regions. In this paper we show that in small packet regimes TCP flows experience severe unfairness, high packet loss rates, and flow silences due to repetitive timeouts. We propose an approximate Markov model to describe TCP behavior in small packet regimes to characterize the TCP breakdown region that leads to repetitive timeout behavior. To enhance TCP performance in such regimes, we propose Timeout Aware Queuing (TAQ), a readily deployable in-network middlebox approach that uses a multi-level adaptive priority queuing algorithm to reduce the probability of timeouts, improve fairness and performance predictability. We demonstrate the effectiveness of TAQ across a spectrum of small packet regime network conditions using simulations, a prototype implementation, and testbed experiments.
{"title":"TAQ: enhancing fairness and performance predictability in small packet regimes","authors":"Jay Chen, L. Subramanian, J. Iyengar, B. Ford","doi":"10.1145/2592798.2592819","DOIUrl":"https://doi.org/10.1145/2592798.2592819","url":null,"abstract":"TCP congestion control algorithms implicitly assume that the per-flow throughput is at least a few packets per round trip time. Environments where this assumption does not hold, which we refer to as small packet regimes, are common in the contexts of wired and cellular networks in developing regions. In this paper we show that in small packet regimes TCP flows experience severe unfairness, high packet loss rates, and flow silences due to repetitive timeouts. We propose an approximate Markov model to describe TCP behavior in small packet regimes to characterize the TCP breakdown region that leads to repetitive timeout behavior. To enhance TCP performance in such regimes, we propose Timeout Aware Queuing (TAQ), a readily deployable in-network middlebox approach that uses a multi-level adaptive priority queuing algorithm to reduce the probability of timeouts, improve fairness and performance predictability. We demonstrate the effectiveness of TAQ across a spectrum of small packet regime network conditions using simulations, a prototype implementation, and testbed experiments.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"40 1","pages":"7:1-7:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90117715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Since NAND flash cannot be updated in place, SSDs must perform all writes in pre-erased pages. Consequently, pages containing superseded data must be invalidated and garbage collected. This garbage collection adds significant cost in terms of the extra writes necessary to relocate valid pages from erasure candidates to clean blocks, causing the well-known write amplification problem. SSDs reserve a certain amount of flash space which is invisible to users, called over-provisioning space, to alleviate the write amplification problem. However, NAND blocks can support only a limited number of program/erase cycles. As blocks are retired due to exceeding the limit, the reduced size of the over-provisioning pool leads to degraded SSD performance. In this work, we propose a novel system design that we call the Smart Retirement FTL (SR-FTL) to reuse the flash blocks which have been cycled to the maximum specified P/E endurance. We take advantage of the fact that the specified P/E limit guarantees data retention time of at least one year while most active data becomes stale in a period much shorter than one year, as observed in a variety of disk workloads. Our approach aggressively manages worn blocks to store data that requires only short retention time. In the meantime, the data reliability on worn blocks is carefully guaranteed. We evaluate the SR-FTL by both simulation on an SSD simulator and prototype implementation on an OpenSSD platform. Experimental results show that the SR-FTL successfully maintains consistent over-provisioning space levels as blocks wear and thus the degree of SSD performance degradation near end-of-life. In addition, we show that our scheme reduces block wear near end-of-life by as much as 84% in some scenarios.
{"title":"An aggressive worn-out flash block management scheme to alleviate SSD performance degradation","authors":"Ping Huang, Guanying Wu, Xubin He, Weijun Xiao","doi":"10.1145/2592798.2592818","DOIUrl":"https://doi.org/10.1145/2592798.2592818","url":null,"abstract":"Since NAND flash cannot be updated in place, SSDs must perform all writes in pre-erased pages. Consequently, pages containing superseded data must be invalidated and garbage collected. This garbage collection adds significant cost in terms of the extra writes necessary to relocate valid pages from erasure candidates to clean blocks, causing the well-known write amplification problem. SSDs reserve a certain amount of flash space which is invisible to users, called over-provisioning space, to alleviate the write amplification problem. However, NAND blocks can support only a limited number of program/erase cycles. As blocks are retired due to exceeding the limit, the reduced size of the over-provisioning pool leads to degraded SSD performance.\u0000 In this work, we propose a novel system design that we call the Smart Retirement FTL (SR-FTL) to reuse the flash blocks which have been cycled to the maximum specified P/E endurance. We take advantage of the fact that the specified P/E limit guarantees data retention time of at least one year while most active data becomes stale in a period much shorter than one year, as observed in a variety of disk workloads. Our approach aggressively manages worn blocks to store data that requires only short retention time. In the meantime, the data reliability on worn blocks is carefully guaranteed. We evaluate the SR-FTL by both simulation on an SSD simulator and prototype implementation on an OpenSSD platform. Experimental results show that the SR-FTL successfully maintains consistent over-provisioning space levels as blocks wear and thus the degree of SSD performance degradation near end-of-life. In addition, we show that our scheme reduces block wear near end-of-life by as much as 84% in some scenarios.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"9 1","pages":"22:1-22:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88442076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabien André, Anne-Marie Kermarrec, E. L. Merrer, Nicolas Le Scouarnec, G. Straub, Alexandre van Kempen
Modern storage systems now typically combine plain replication and erasure codes to reliably store large amount of data in datacenters. Plain replication allows a fast access to popular data, while erasure codes, e.g., Reed-Solomon codes, provide a storage-efficient alternative for archiving less popular data. Although erasure codes are now increasingly employed in real systems, they experience high overhead during maintenance, i.e., upon failures, typically requiring files to be decoded before being encoded again to repair the encoded blocks stored at the faulty node. In this paper, we propose a novel erasure code system, tailored for networked archival systems. The efficiency of our approach relies on the joint use of random codes and a clustered placement strategy. Our repair protocol leverages network coding techniques to reduce by 50% the amount of data transferred during maintenance, by repairing several cluster files simultaneously. We demonstrate both through an analysis and extensive experimental study conducted on a public testbed that our approach significantly decreases both the bandwidth overhead during the maintenance process and the time to repair lost data. We also show that using a non-systematic code does not impact the throughput, and comes only at the price of a higher CPU usage. Based on these results, we evaluate the impact of this higher CPU consumption on different configurations of data coldness by determining whether the cluster's network bandwidth dedicated to repair or CPU dedicated to decoding saturates first.
{"title":"Archiving cold data in warehouses with clustered network coding","authors":"Fabien André, Anne-Marie Kermarrec, E. L. Merrer, Nicolas Le Scouarnec, G. Straub, Alexandre van Kempen","doi":"10.1145/2592798.2592816","DOIUrl":"https://doi.org/10.1145/2592798.2592816","url":null,"abstract":"Modern storage systems now typically combine plain replication and erasure codes to reliably store large amount of data in datacenters. Plain replication allows a fast access to popular data, while erasure codes, e.g., Reed-Solomon codes, provide a storage-efficient alternative for archiving less popular data. Although erasure codes are now increasingly employed in real systems, they experience high overhead during maintenance, i.e., upon failures, typically requiring files to be decoded before being encoded again to repair the encoded blocks stored at the faulty node. In this paper, we propose a novel erasure code system, tailored for networked archival systems. The efficiency of our approach relies on the joint use of random codes and a clustered placement strategy. Our repair protocol leverages network coding techniques to reduce by 50% the amount of data transferred during maintenance, by repairing several cluster files simultaneously. We demonstrate both through an analysis and extensive experimental study conducted on a public testbed that our approach significantly decreases both the bandwidth overhead during the maintenance process and the time to repair lost data. We also show that using a non-systematic code does not impact the throughput, and comes only at the price of a higher CPU usage. Based on these results, we evaluate the impact of this higher CPU consumption on different configurations of data coldness by determining whether the cluster's network bandwidth dedicated to repair or CPU dedicated to decoding saturates first.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"11 1","pages":"21:1-21:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88614750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The simplest strategy to guarantee good quality of service (QoS) for a latency-sensitive workload with sub-millisecond latency in a shared cluster environment is to never run other workloads concurrently with it on the same server. Unfortunately, this inevitably leads to low server utilization, reducing both the capability and cost effectiveness of the cluster. In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads. We show that workload co-location leads to QoS violations due to increases in queuing delay, scheduling delay, and thread load imbalance. We present techniques that address these vulnerabilities, ranging from provisioning the latency-critical service in an interference aware manner, to replacing the Linux CFS scheduler with a scheduler that provides good latency guarantees and fairness for co-located workloads. Ultimately, we demonstrate that some latency-critical workloads can be aggressively co-located with other workloads, achieve good QoS, and that such co-location can improve a datacenter's effective throughput per TCO-$ by up to 52%.
{"title":"Reconciling high server utilization and sub-millisecond quality-of-service","authors":"J. Leverich, C. Kozyrakis","doi":"10.1145/2592798.2592821","DOIUrl":"https://doi.org/10.1145/2592798.2592821","url":null,"abstract":"The simplest strategy to guarantee good quality of service (QoS) for a latency-sensitive workload with sub-millisecond latency in a shared cluster environment is to never run other workloads concurrently with it on the same server. Unfortunately, this inevitably leads to low server utilization, reducing both the capability and cost effectiveness of the cluster.\u0000 In this paper, we analyze the challenges of maintaining high QoS for low-latency workloads when sharing servers with other workloads. We show that workload co-location leads to QoS violations due to increases in queuing delay, scheduling delay, and thread load imbalance. We present techniques that address these vulnerabilities, ranging from provisioning the latency-critical service in an interference aware manner, to replacing the Linux CFS scheduler with a scheduler that provides good latency guarantees and fairness for co-located workloads. Ultimately, we demonstrate that some latency-critical workloads can be aggressively co-located with other workloads, achieve good QoS, and that such co-location can improve a datacenter's effective throughput per TCO-$ by up to 52%.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"57 1","pages":"4:1-4:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90870462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marios Fragkoulis, D. Spinellis, P. Louridas, A. Bilas
State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics. This work contributes a method and an implementation for mapping a kernel's data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel's data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module's source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries. We demonstrate PiCO QL's usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.
{"title":"Relational access to Unix kernel data structures","authors":"Marios Fragkoulis, D. Spinellis, P. Louridas, A. Bilas","doi":"10.1145/2592798.2592802","DOIUrl":"https://doi.org/10.1145/2592798.2592802","url":null,"abstract":"State of the art kernel diagnostic tools like DTrace and Systemtap provide a procedural interface for expressing analysis tasks. We argue that a relational interface to kernel data structures can offer complementary benefits for kernel diagnostics.\u0000 This work contributes a method and an implementation for mapping a kernel's data structures to a relational interface. The Pico COllections Query Library (PiCO QL) Linux kernel module uses a domain specific language to define a relational representation of accessible Linux kernel data structures, a parser to analyze the definitions, and a compiler to implement an SQL interface to the data structures. It then evaluates queries written in SQL against the kernel's data structures. PiCO QL queries are interactive and type safe. Unlike SystemTap and DTrace, PiCO QL is less intrusive because it does not require kernel instrumentation; instead it hooks to existing kernel data structures through the module's source code. PiCO QL imposes no overhead when idle and needs only access to the kernel data structures that contain relevant information for answering the input queries.\u0000 We demonstrate PiCO QL's usefulness by presenting Linux kernel queries that provide meaningful custom views of system resources and pinpoint issues, such as security vulnerabilities and performance problems.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"2016 1","pages":"12:1-12:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86502461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Subramanya R. Dulloor, Sanjay Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, Jeffrey R. Jackson
Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability. Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.
{"title":"System software for persistent memory","authors":"Subramanya R. Dulloor, Sanjay Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, Jeffrey R. Jackson","doi":"10.1145/2592798.2592814","DOIUrl":"https://doi.org/10.1145/2592798.2592814","url":null,"abstract":"Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability.\u0000 Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"54 1","pages":"15:1-15:15"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88969626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, Enhong Chen
Temporal graphs capture changes in graphs over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on a time-evolving social graph. Chronos is a storage and execution engine designed and optimized specifically for running in-memory iterative graph computation on temporal graphs. Locality is at the center of the Chronos design, where the in-memory layout of temporal graphs and the scheduling of the iterative computation on temporal graphs are carefully designed, so that common "bulk" operations on temporal graphs are scheduled to maximize the benefit of in-memory data locality. The design of Chronos further explores the interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs. The result is a high-performance temporal-graph system that offers up to an order of magnitude speedup for temporal iterative graph mining compared to a straightforward application of existing graph engines on a series of snapshots.
{"title":"Chronos: a graph engine for temporal graph analysis","authors":"Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vijayan Prabhakaran, Wenguang Chen, Enhong Chen","doi":"10.1145/2592798.2592799","DOIUrl":"https://doi.org/10.1145/2592798.2592799","url":null,"abstract":"Temporal graphs capture changes in graphs over time and are becoming a subject that attracts increasing interest from the research communities, for example, to understand temporal characteristics of social interactions on a time-evolving social graph. Chronos is a storage and execution engine designed and optimized specifically for running in-memory iterative graph computation on temporal graphs. Locality is at the center of the Chronos design, where the in-memory layout of temporal graphs and the scheduling of the iterative computation on temporal graphs are carefully designed, so that common \"bulk\" operations on temporal graphs are scheduled to maximize the benefit of in-memory data locality. The design of Chronos further explores the interesting interplay among locality, parallelism, and incremental computation in supporting common mining tasks on temporal graphs. The result is a high-performance temporal-graph system that offers up to an order of magnitude speedup for temporal iterative graph mining compared to a straightforward application of existing graph engines on a series of snapshots.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"5 1","pages":"1:1-1:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82128248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Transactional memory (TM) has reached a maturity level and programmers have started using this programming model to parallelize their applications. However, although much effort has been put into the development of TM systems, there is still lack of debugging and development tools for TM applications, such as race detection tools. Previous definitions of transactional data race often impose constraints on the TM implementation or the programming language and cannot be widely applied to current STM designs. We propose a new definition of transactional data race that follows the programmer's intuition of racy accesses, is independent of thread interleaving, can accommodate popular STM systems, and allows common programming idioms. Based on this definition, we design and implement T-Rex, a precise dynamic race detection tool for C/C++ TM programs. Using T-Rex we discover transactional data races in STAMP applications that, to the best of our knowledge, have not been previously reported. Our experiments also show that T-Rex runtime overhead is comparable to state-of-the-art lock-based race detection tools, despite the extra work required to handle transactional memory semantics.
事务性内存(TM)已经达到了成熟的程度,程序员已经开始使用这种编程模型来并行化他们的应用程序。然而,尽管在TM系统的开发上投入了大量的精力,但TM应用程序的调试和开发工具,如种族检测工具,仍然缺乏。以前的事务性数据竞争定义常常对TM实现或编程语言施加约束,不能广泛应用于当前的STM设计。我们提出了事务性数据竞争的新定义,它遵循程序员对动态访问的直觉,独立于线程交错,可以适应流行的STM系统,并允许常见的编程习惯。基于这一定义,我们设计并实现了T-Rex,一个用于C/ c++ TM程序的精确动态竞争检测工具。使用T-Rex,我们发现了STAMP应用程序中的事务数据竞争,据我们所知,这些竞争以前没有被报道过。我们的实验还表明,T-Rex运行时开销与最先进的基于锁的竞争检测工具相当,尽管需要额外的工作来处理事务性内存语义。
{"title":"T-Rex: a dynamic race detection tool for C/C++ transactional memory applications","authors":"Gokcen Kestor, O. Unsal, A. Cristal, S. Tasiran","doi":"10.1145/2592798.2592809","DOIUrl":"https://doi.org/10.1145/2592798.2592809","url":null,"abstract":"Transactional memory (TM) has reached a maturity level and programmers have started using this programming model to parallelize their applications. However, although much effort has been put into the development of TM systems, there is still lack of debugging and development tools for TM applications, such as race detection tools.\u0000 Previous definitions of transactional data race often impose constraints on the TM implementation or the programming language and cannot be widely applied to current STM designs. We propose a new definition of transactional data race that follows the programmer's intuition of racy accesses, is independent of thread interleaving, can accommodate popular STM systems, and allows common programming idioms.\u0000 Based on this definition, we design and implement T-Rex, a precise dynamic race detection tool for C/C++ TM programs. Using T-Rex we discover transactional data races in STAMP applications that, to the best of our knowledge, have not been previously reported. Our experiments also show that T-Rex runtime overhead is comparable to state-of-the-art lock-based race detection tools, despite the extra work required to handle transactional memory semantics.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"28 1","pages":"20:1-20:12"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74859811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Alistarh, P. Eugster, M. Herlihy, A. Matveev, N. Shavit
Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to the specific data structure by the programmer, or both. This paper presents StackTrack, the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency. StackTrack eliminates most of the expensive bookkeeping required for memory reclamation by leveraging the power of hardware transactional memory (HTM) in a new way: it tracks thread variables dynamically, and in an atomic fashion. This effectively makes all memory references visible without having threads pay the overhead of writing out this information. Our empirical results show that this new approach matches or outperforms prior, non-automated, techniques.
{"title":"StackTrack: an automated transactional approach to concurrent memory reclamation","authors":"Dan Alistarh, P. Eugster, M. Herlihy, A. Matveev, N. Shavit","doi":"10.1145/2592798.2592808","DOIUrl":"https://doi.org/10.1145/2592798.2592808","url":null,"abstract":"Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to the specific data structure by the programmer, or both. This paper presents StackTrack, the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency. StackTrack eliminates most of the expensive bookkeeping required for memory reclamation by leveraging the power of hardware transactional memory (HTM) in a new way: it tracks thread variables dynamically, and in an atomic fashion. This effectively makes all memory references visible without having threads pay the overhead of writing out this information. Our empirical results show that this new approach matches or outperforms prior, non-automated, techniques.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"1 1","pages":"25:1-25:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80468093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junlan Zhou, Malveeka Tewari, Min Zhu, A. Kabbani, L. Poutievski, Arjun Singh, Amin Vahdat
Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows. We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP.
{"title":"WCMP: weighted cost multipathing for improved fairness in data centers","authors":"Junlan Zhou, Malveeka Tewari, Min Zhu, A. Kabbani, L. Poutievski, Arjun Singh, Amin Vahdat","doi":"10.1145/2592798.2592803","DOIUrl":"https://doi.org/10.1145/2592798.2592803","url":null,"abstract":"Data Center topologies employ multiple paths among servers to deliver scalable, cost-effective network capacity. The simplest and the most widely deployed approach for load balancing among these paths, Equal Cost Multipath (ECMP), hashes flows among the shortest paths toward a destination. ECMP leverages uniform hashing of balanced flow sizes to achieve fairness and good load balancing in data centers. However, we show that ECMP further assumes a balanced, regular, and fault-free topology, which are invalid assumptions in practice that can lead to substantial performance degradation and, worse, variation in flow bandwidths even for same size flows.\u0000 We present a set of simple algorithms that achieve Weighted Cost Multipath (WCMP) to balance traffic in the data center based on the changing network topology. The state required for WCMP is already disseminated as part of standard routing protocols and it can be readily implemented in the current switch silicon without any hardware modifications. We show how to deploy WCMP in a production OpenFlow network environment and present experimental and simulation results to show that variation in flow bandwidths can be reduced by as much as 25X by employing WCMP relative to ECMP.","PeriodicalId":20737,"journal":{"name":"Proceedings of the Eleventh European Conference on Computer Systems","volume":"84 1","pages":"5:1-5:14"},"PeriodicalIF":0.0,"publicationDate":"2014-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83340173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}