Memory models are used in concurrent systems to specify visibility properties of shared data. A practical memory model, however, must permit code optimization as well as provide a useful semantics for programmers. Here we extend recent observations that the current Java memory model imposes significant restrictions on the ability to optimize code. Beyond the known and potentially correctable proof concerns illustrated by others we show that major constraints on code generation and optimization can in fact be derived from fundamental properties and guarantees provided by the memory model. To address this and accommodate a better balance between programmability and optimization we present ideas for a simple concurrency semantics for Java that avoids basic problems at a cost of backward compatibility.
{"title":"There is nothing wrong with out-of-thin-air: compiler optimization and memory models","authors":"Clark Verbrugge, Allan Kielstra, Yi Zhang","doi":"10.1145/1988915.1988917","DOIUrl":"https://doi.org/10.1145/1988915.1988917","url":null,"abstract":"Memory models are used in concurrent systems to specify visibility properties of shared data. A practical memory model, however, must permit code optimization as well as provide a useful semantics for programmers. Here we extend recent observations that the current Java memory model imposes significant restrictions on the ability to optimize code. Beyond the known and potentially correctable proof concerns illustrated by others we show that major constraints on code generation and optimization can in fact be derived from fundamental properties and guarantees provided by the memory model. To address this and accommodate a better balance between programmability and optimization we present ideas for a simple concurrency semantics for Java that avoids basic problems at a cost of backward compatibility.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123773307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laura Effinger-Dean, H. Boehm, Dhruva R. Chakrabarti, P. Joisha
Most multithreaded programming languages prohibit or discourage data races. By avoiding data races, we are guaranteed that variables accessed within a synchronization-free code region cannot be modified by other threads, allowing us to reason about such code regions as though they were single-threaded. However, such single-threaded reasoning is not limited to synchronization-free regions. We present a simple characterization of extended interference-free regions in which variables cannot be modified by other threads. This characterization shows that, in the absence of data races, important code analysis problems often have surprisingly easy answers. For instance, we can use local analysis to determine when lock and unlock calls refer to the same mutex. Our characterization can be derived from prior work on safe compiler transformations, but it can also be simply derived from first principles, and justified in a very broad context. In addition, systematic reasoning about overlapping interference-free regions yields insight about optimization opportunities that were not previously apparent. We give preliminary results for a prototype implementation of interference-free regions in the LLVM compiler infrastructure. We also discuss other potential applications for interference-free regions.
{"title":"Extended sequential reasoning for data-race-free programs","authors":"Laura Effinger-Dean, H. Boehm, Dhruva R. Chakrabarti, P. Joisha","doi":"10.1145/1988915.1988922","DOIUrl":"https://doi.org/10.1145/1988915.1988922","url":null,"abstract":"Most multithreaded programming languages prohibit or discourage data races. By avoiding data races, we are guaranteed that variables accessed within a synchronization-free code region cannot be modified by other threads, allowing us to reason about such code regions as though they were single-threaded. However, such single-threaded reasoning is not limited to synchronization-free regions. We present a simple characterization of extended interference-free regions in which variables cannot be modified by other threads.\u0000 This characterization shows that, in the absence of data races, important code analysis problems often have surprisingly easy answers. For instance, we can use local analysis to determine when lock and unlock calls refer to the same mutex. Our characterization can be derived from prior work on safe compiler transformations, but it can also be simply derived from first principles, and justified in a very broad context. In addition, systematic reasoning about overlapping interference-free regions yields insight about optimization opportunities that were not previously apparent.\u0000 We give preliminary results for a prototype implementation of interference-free regions in the LLVM compiler infrastructure. We also discuss other potential applications for interference-free regions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131207478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proposals to treat data races as exceptions provide simplified semantics for shared-memory multithreaded programming languages and memory models by guaranteeing that execution remains data-race-free and sequentially consistent or an exception is raised. However, the high cost of precise race detection has kept the cost-to-benefit ratio of data-race exceptions too high for widespread adoption. Most research to improve this ratio focuses on lowering performance cost. In this position paper, we argue that with small changes in how we view data races, data-race exceptions enable a broad class of benefits beyond the memory model, including performance and simplicity in applications at the runtime system level. When attempted (but exception-raising) racy accesses are treated as legal --- but exceptional --- behavior, applications can exploit the guarantees of the data-race exception mechanism by performing potentially racy accesses and guiding execution based on whether these potential races manifest as exceptions. We apply these insights to concurrent garbage collection, optimistic synchronization elision, and best-effort automatic recovery from exceptions due to sequential-consistency-violating races.
{"title":"Data-race exceptions have benefits beyond the memory model","authors":"Benjamin P. Wood, L. Ceze, D. Grossman","doi":"10.1145/1988915.1988923","DOIUrl":"https://doi.org/10.1145/1988915.1988923","url":null,"abstract":"Proposals to treat data races as exceptions provide simplified semantics for shared-memory multithreaded programming languages and memory models by guaranteeing that execution remains data-race-free and sequentially consistent or an exception is raised. However, the high cost of precise race detection has kept the cost-to-benefit ratio of data-race exceptions too high for widespread adoption. Most research to improve this ratio focuses on lowering performance cost.\u0000 In this position paper, we argue that with small changes in how we view data races, data-race exceptions enable a broad class of benefits beyond the memory model, including performance and simplicity in applications at the runtime system level. When attempted (but exception-raising) racy accesses are treated as legal --- but exceptional --- behavior, applications can exploit the guarantees of the data-race exception mechanism by performing potentially racy accesses and guiding execution based on whether these potential races manifest as exceptions. We apply these insights to concurrent garbage collection, optimistic synchronization elision, and best-effort automatic recovery from exceptions due to sequential-consistency-violating races.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114181913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most mainstream shared-memory parallel programming languages are converging to a memory model, or shared variable semantics, centered on providing sequential consistency for most data-race-free programs. OpenMP, along with a small number of other languages, defines its memory model in terms of implicit fence (e.g. OpenMP flush) operations that force memory accesses to become visible to other threads in order. Synchronization operations provided by the language implicitly include such fences. In the simplest cases this is equivalent to a promise of sequential consistency for data-race-free programs. However, real languages typically also provide atomic operations with weak memory ordering constraints, such as the OpenMP atomic directives. These break the above equivalence, making the fence-based model stronger in ways that are observable, but not generally useful. As a result, conventional lock implementations are often accidentally prohibited, adding significant overhead for uncontended locks. We show that this problem affects both OpenMP and, in a more subtle way, UPC. We have been working with the OpenMP ARB to resolve these issues in future versions of OpenMP.
{"title":"Performance implications of fence-based memory models","authors":"H. Boehm","doi":"10.1145/1988915.1988919","DOIUrl":"https://doi.org/10.1145/1988915.1988919","url":null,"abstract":"Most mainstream shared-memory parallel programming languages are converging to a memory model, or shared variable semantics, centered on providing sequential consistency for most data-race-free programs.\u0000 OpenMP, along with a small number of other languages, defines its memory model in terms of implicit fence (e.g. OpenMP flush) operations that force memory accesses to become visible to other threads in order. Synchronization operations provided by the language implicitly include such fences. In the simplest cases this is equivalent to a promise of sequential consistency for data-race-free programs.\u0000 However, real languages typically also provide atomic operations with weak memory ordering constraints, such as the OpenMP atomic directives. These break the above equivalence, making the fence-based model stronger in ways that are observable, but not generally useful. As a result, conventional lock implementations are often accidentally prohibited, adding significant overhead for uncontended locks.\u0000 We show that this problem affects both OpenMP and, in a more subtle way, UPC. We have been working with the OpenMP ARB to resolve these issues in future versions of OpenMP.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126903969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Collaborative caching uses different caching methods, e. g., LRU and MRU, for data with good or poor locality. Poorlocality data are evicted by MRU quickly, leaving most cache space to hold good-locality data by LRU. In our previous study, we selected static memory references with poor locality to use MRU but neglected minor references, which are memory instructions that contribute no more than 0.1% total memory accesses. After removing this restriction, we found that three SPEC CPU benchmarks have on average 6.2 times fewer miss reduction or 9.8% reduction in absolute miss ratio.
{"title":"Minor memory references matter in collaborative caching","authors":"Xiaoming Gu","doi":"10.1145/1988915.1988927","DOIUrl":"https://doi.org/10.1145/1988915.1988927","url":null,"abstract":"Collaborative caching uses different caching methods, e. g., LRU and MRU, for data with good or poor locality. Poorlocality data are evicted by MRU quickly, leaving most cache space to hold good-locality data by LRU. In our previous study, we selected static memory references with poor locality to use MRU but neglected minor references, which are memory instructions that contribute no more than 0.1% total memory accesses. After removing this restriction, we found that three SPEC CPU benchmarks have on average 6.2 times fewer miss reduction or 9.8% reduction in absolute miss ratio.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114519788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sven Auhagen, Lars Bergstrom, M. Fluet, John H. Reppy
Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are connected via a shared bus, creating a system with a heterogeneous memory hierarchy. Since this shared bus has less bandwidth than the sum of the links to memory, aggregate memory bandwidth is higher when parallel threads all access memory local to their processor package than when they access memory attached to a remote package. This bandwidth limitation has traditionally limited the scalability of modern functional language implementations, which seldom scale well past 8 cores, even on small benchmarks. This work presents a garbage collector integrated with our strict, parallel functional language implementation, Manticore, and shows that it scales effectively on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine.
{"title":"Garbage collection for multicore NUMA machines","authors":"Sven Auhagen, Lars Bergstrom, M. Fluet, John H. Reppy","doi":"10.1145/1988915.1988929","DOIUrl":"https://doi.org/10.1145/1988915.1988929","url":null,"abstract":"Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are connected via a shared bus, creating a system with a heterogeneous memory hierarchy. Since this shared bus has less bandwidth than the sum of the links to memory, aggregate memory bandwidth is higher when parallel threads all access memory local to their processor package than when they access memory attached to a remote package. This bandwidth limitation has traditionally limited the scalability of modern functional language implementations, which seldom scale well past 8 cores, even on small benchmarks.\u0000 This work presents a garbage collector integrated with our strict, parallel functional language implementation, Manticore, and shows that it scales effectively on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"2003 16","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113966530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The shared memory research community has proposed many complex communication protocols that aim to eliminate specific performance bottlenecks, while still providing an easy-to-use communication interface. Although tailored protocols can eliminate some bottlenecks that arise in real applications, removing the cause of the bottleneck through software optimizations and bug fixes is cheaper to implement, faster to fix (once found), and requires no additional support by the hardware beyond a simple shared memory interface. In fact, in our experience, the choice of coherence protocol is much less important than providing an efficient hardware feedback that indentifies the source of the problem. Future cache-coherence research should focus efforts on illuminating memory system behavior, providing smarter tools to identify bottlenecks, and helping to eliminate them through software optimizations.
{"title":"The case for simple, visible cache coherency","authors":"R. Kunz, M. Horowitz","doi":"10.1145/1353522.1353532","DOIUrl":"https://doi.org/10.1145/1353522.1353532","url":null,"abstract":"The shared memory research community has proposed many complex communication protocols that aim to eliminate specific performance bottlenecks, while still providing an easy-to-use communication interface. Although tailored protocols can eliminate some bottlenecks that arise in real applications, removing the cause of the bottleneck through software optimizations and bug fixes is cheaper to implement, faster to fix (once found), and requires no additional support by the hardware beyond a simple shared memory interface. In fact, in our experience, the choice of coherence protocol is much less important than providing an efficient hardware feedback that indentifies the source of the problem. Future cache-coherence research should focus efforts on illuminating memory system behavior, providing smarter tools to identify bottlenecks, and helping to eliminate them through software optimizations.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces GC assertions, a system interface that programmers can use to check for errors, such as data structure invariant violations, and to diagnose performance problems, such as memory leaks. GC assertions are checked by the garbage collector, which is in a unique position to gather information and answer questions about the lifetime and connectivity of objects in the heap. We introduce several kinds of GC assertions, and we describe how they are implemented in the collector. We also describe our reporting mechanism, which provides a complete path through the heap to the offending objects. We show results for one type of assertion that allows the programmer to indicate that an object should be reclaimed at the next GC. We find that using this assertion we can quickly identify a memory leak and its cause with negligible overhead.
{"title":"GC assertions: using the garbage collector to check heap properties","authors":"E. Aftandilian, Samuel Z. Guyer","doi":"10.1145/1353522.1353533","DOIUrl":"https://doi.org/10.1145/1353522.1353533","url":null,"abstract":"This paper introduces GC assertions, a system interface that programmers can use to check for errors, such as data structure invariant violations, and to diagnose performance problems, such as memory leaks. GC assertions are checked by the garbage collector, which is in a unique position to gather information and answer questions about the lifetime and connectivity of objects in the heap. We introduce several kinds of GC assertions, and we describe how they are implemented in the collector. We also describe our reporting mechanism, which provides a complete path through the heap to the offending objects. We show results for one type of assertion that allows the programmer to indicate that an object should be reclaimed at the next GC. We find that using this assertion we can quickly identify a memory leak and its cause with negligible overhead.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115968109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes a formalization of the ARM weakly consistent memory model: the architectural contract between parallel programs and shared memory multiprocessor implementations. We claim that a clean, unambiguous, and mechanically verifiable specification is a valuable resource for architects, micro-architects and programmers; it allows implementors to forge aggressive static (compiler) and dynamic (JIT, micro-architecture) machines to run code. We discuss the key construct of the ARM memory model, observability -- the order in which memory accesses become visible to processors in a shared memory multiprocessor system -- and examine its use in litmus tests.
{"title":"Reasoning about the ARM weakly consistent memory model","authors":"Nathan Chong, Samin S. Ishtiaq","doi":"10.1145/1353522.1353528","DOIUrl":"https://doi.org/10.1145/1353522.1353528","url":null,"abstract":"This paper describes a formalization of the ARM weakly consistent memory model: the architectural contract between parallel programs and shared memory multiprocessor implementations. We claim that a clean, unambiguous, and mechanically verifiable specification is a valuable resource for architects, micro-architects and programmers; it allows implementors to forge aggressive static (compiler) and dynamic (JIT, micro-architecture) machines to run code. We discuss the key construct of the ARM memory model, observability -- the order in which memory accesses become visible to processors in a shared memory multiprocessor system -- and examine its use in litmus tests.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121606954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, all major processors provide a set of performance counters which capture micro-architectural level information, such as the number of elapsed cycles, cache misses, or instructions executed. Counters can be found in processor cores, processor die, chipsets, or in I/O cards. They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization. This valuable information can be used to improve code quality and placement of threads to maximize hardware utilization. We postulate that performance counters are the key hardware resource to locate and understand issues related to the memory subsystem. In this paper we illustrate our position by showing how certain key memory performance metrics can be gathered easily on today's hardware.
{"title":"What can performance counters do for memory subsystem analysis?","authors":"S. Eranian","doi":"10.1145/1353522.1353531","DOIUrl":"https://doi.org/10.1145/1353522.1353531","url":null,"abstract":"Nowadays, all major processors provide a set of performance counters which capture micro-architectural level information, such as the number of elapsed cycles, cache misses, or instructions executed. Counters can be found in processor cores, processor die, chipsets, or in I/O cards. They can provide a wealth of information as to how the hardware is being used by software. Many processors now support events to measure precisely and with very limited overhead, the traffic between a core and the memory subsystem. It is possible to compute average load latency and bus band-width utilization. This valuable information can be used to improve code quality and placement of threads to maximize hardware utilization.\u0000 We postulate that performance counters are the key hardware resource to locate and understand issues related to the memory subsystem. In this paper we illustrate our position by showing how certain key memory performance metrics can be gathered easily on today's hardware.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127775042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}