Supporting precise exceptions has been one of the essential components of designing modern out-of-order processors. It allows handling exception routines, including virtual memory support and also supports debugging features. However, GPGPU, one of the recent popular scientific computing platforms, does not support precise exceptions. Here, in this paper, we argue that supporting precise exceptions is not essential for GPGPUs and we propose an alternate solution to provide virtual memory support without supporting precise exceptions.
{"title":"Supporting virtual memory in GPGPU without supporting precise exceptions","authors":"Hyesoon Kim","doi":"10.1145/2247684.2247698","DOIUrl":"https://doi.org/10.1145/2247684.2247698","url":null,"abstract":"Supporting precise exceptions has been one of the essential components of designing modern out-of-order processors. It allows handling exception routines, including virtual memory support and also supports debugging features. However, GPGPU, one of the recent popular scientific computing platforms, does not support precise exceptions. Here, in this paper, we argue that supporting precise exceptions is not essential for GPGPUs and we propose an alternate solution to provide virtual memory support without supporting precise exceptions.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127014212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.
{"title":"Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis","authors":"Meng-Ju Wu, D. Yeung","doi":"10.1145/2247684.2247687","DOIUrl":"https://doi.org/10.1145/2247684.2247687","url":null,"abstract":"Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies employed in modern CPUs. In today's hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communication in private caches. Researchers normally perform extensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel programs, an important class of programs for which RD analysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114475522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.
{"title":"Defensive loop tiling for multi-core processor","authors":"Bin Bao, Xiaoya Xiang","doi":"10.1145/2247684.2247701","DOIUrl":"https://doi.org/10.1145/2247684.2247701","url":null,"abstract":"Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache sharing requires special types of tiling depending on the co-run programs. We analyze the reasons for the performance difference and give a defensive strategy that performs consistently the best or near the best. For example, when compared with conservative tiling, which tiles for private cache, the performance of defensive tiling is similar in solo-runs but up to 20% higher in program co-runs, when tested on an Intel multicore processor.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. A recent study shows the promise of eliminating irregular references through runtime thread-data remapping. However, how to efficiently determine the optimal mapping is yet an open question. This paper presents some initial exploration to the question, especially in the dimension of data layout optimization. It describes three algorithms to compute or approximate optimal data layouts for GPU. These algorithms exhibit a spectrum of tradeoff among the space cost, time cost, and quality of the resulting data layouts.
{"title":"A study towards optimal data layout for GPU computing","authors":"E. Zhang, Han Li, Xipeng Shen","doi":"10.1145/2247684.2247699","DOIUrl":"https://doi.org/10.1145/2247684.2247699","url":null,"abstract":"The performance of Graphic Processing Units (GPU) is sensitive to irregular memory references. A recent study shows the promise of eliminating irregular references through runtime thread-data remapping. However, how to efficiently determine the optimal mapping is yet an open question. This paper presents some initial exploration to the question, especially in the dimension of data layout optimization. It describes three algorithms to compute or approximate optimal data layouts for GPU. These algorithms exhibit a spectrum of tradeoff among the space cost, time cost, and quality of the resulting data layouts.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127813895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick
Energy consumption is the fundamental barrier to exascale supercomputing and it is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. A memory system based on these technologies has the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. To reach the goals of exascale computing with a manageable power budget, the industry will have to adopt these technologies. Doing so will enable exascale computing, and will have a major worldwide economic impact.
{"title":"Let there be light!: the future of memory systems is photonics and 3D stacking","authors":"K. Bergman, G. Hendry, Paul H. Hargrove, J. Shalf, B. Jacob, K. Hemmert, Arun Rodrigues, D. Resnick","doi":"10.1145/1988915.1988926","DOIUrl":"https://doi.org/10.1145/1988915.1988926","url":null,"abstract":"Energy consumption is the fundamental barrier to exascale supercomputing and it is dominated by the cost of moving data from one point to another, not computation. Similarly, performance is dominated by data movement, not computation. The solution to this problem requires three critical technologies: 3D integration, optical chip-to-chip communication, and a new communication model. A memory system based on these technologies has the potential to lower the cost of local memory accesses by orders of magnitude and provide substantially more bandwidth. To reach the goals of exascale computing with a manageable power budget, the industry will have to adopt these technologies. Doing so will enable exascale computing, and will have a major worldwide economic impact.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130016231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley
Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript#8482;, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later. We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java#8482; ---they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance. Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.
{"title":"Deferred gratification: engineering for high performance garbage collection from the get go","authors":"Ivan Jibaja, S. Blackburn, M. Haghighat, K. McKinley","doi":"10.1145/1988915.1988930","DOIUrl":"https://doi.org/10.1145/1988915.1988930","url":null,"abstract":"Implementing a new programming language system is a daunting task. A common trap is to punt on the design and engineering of exact garbage collection and instead opt for reference counting or conservative garbage collection (GC). For example, AppleScript#8482;, Perl, Python, and PHP implementers chose reference counting (RC) and Ruby chose conservative GC. Although easier to get working, reference counting has terrible performance and conservative GC is inflexible and performs poorly when allocation rates are high. However, high performance GC is central to performance for managed languages and only becoming more critical due to relatively lower memory bandwidth and higher memory latency of modern architectures. Unfortunately, retrofitting support for high performance collectors later is a formidable software engineering task due to their exact nature. Whether they realize it or not, implementers have three routes: (1) forge ahead with reference counting or conservative GC, and worry about the consequences later; (2) build the language on top of an existing managed runtime with exact GC, and tune the GC to scripting language workloads; or (3) engineer exact GC from the ground up and enjoy the correctness and performance benefits sooner rather than later.\u0000 We explore this conundrum using PHP, the most popular server side scripting language. PHP implements reference counting, mirroring scripting languages before it. Because reference counting is incomplete, the implementors must (a) also implement tracing to detect cyclic garbage, or (b) prohibit cyclic data structures, or (c) never reclaim cyclic garbage. PHP chose (a), AppleScript chose (b), and Perl chose (c). We characterize the memory behavior of five typical PHP programs to determine whether their implementation choice was a good one in light of the growing demand for high performance PHP. The memory behavior of these PHP programs is similar to other managed languages, such as Java#8482; ---they allocate many short lived objects, a large variety of object sizes, and the average allocated object size is small. These characteristics suggest copying generational GC will attain high performance.\u0000 Language implementers who are serious about correctness and performance need to understand deferred gratification: paying the software engineering cost of exact GC up front will deliver correctness and memory system performance later.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121129870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A footprint is the volume of data accessed in a time window. A complete characterization requires summarizing all footprints in all execution windows. A concise summary is the footprint curve, which gives the average footprint in windows of different lengths. The footprint curve contains information from all footprints. It can be measured in time O(n) for a trace of length n, which is fast enough for most benchmarks. In this paper, we outline a study on footprint curves. We propose four curve fitting methods based on the real data observed in SPEC benchmark programs.
{"title":"How to fit program footprint curves","authors":"Xiaoya Xiang, Bin Bao","doi":"10.1145/1988915.1988920","DOIUrl":"https://doi.org/10.1145/1988915.1988920","url":null,"abstract":"A footprint is the volume of data accessed in a time window. A complete characterization requires summarizing all footprints in all execution windows. A concise summary is the footprint curve, which gives the average footprint in windows of different lengths. The footprint curve contains information from all footprints. It can be measured in time O(n) for a trace of length n, which is fast enough for most benchmarks.\u0000 In this paper, we outline a study on footprint curves. We propose four curve fitting methods based on the real data observed in SPEC benchmark programs.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128419085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos
The currently dominant programming models to write software for multicore processors use threads that run over shared memory. However, as the core count increases, cache coherency protocols get very complex and ineffective, and maintaining a shared memory abstraction becomes expensive and impractical. Moreover, writing multithreaded programs is notoriously difficult, as the programmer needs to reason about all the possible thread interleavings and interactions, including the myriad of implicit, non-obvious, and often unpredictable thread interactions through shared memory. Overall, as processors get more cores and parallel software becomes mainstream, the shared memory model reaches its limits regarding ease of programming and efficiency. This position paper presents two ideas aiming to solve the problem. First, we restrict the way the programmer expresses parallelism: The program is a collection of possibly recursive tasks, where each task is atomic and cannot communicate with any other task during its execution. Second, we relax the requirement for coherent shared memory: Each task defines its memory footprint, and is guaranteed to have exclusive access to that memory during its execution. Using this model, we can then define a runtime system that transparently performs the data transfers required among cores without cache coherency, and also produces a deterministic execution of the program, provably equivalent to its sequential elision.
{"title":"A programming model for deterministic task parallelism","authors":"Polyvios Pratikakis, H. Vandierendonck, Spyros Lyberis, Dimitrios S. Nikolopoulos","doi":"10.1145/1988915.1988918","DOIUrl":"https://doi.org/10.1145/1988915.1988918","url":null,"abstract":"The currently dominant programming models to write software for multicore processors use threads that run over shared memory. However, as the core count increases, cache coherency protocols get very complex and ineffective, and maintaining a shared memory abstraction becomes expensive and impractical. Moreover, writing multithreaded programs is notoriously difficult, as the programmer needs to reason about all the possible thread interleavings and interactions, including the myriad of implicit, non-obvious, and often unpredictable thread interactions through shared memory. Overall, as processors get more cores and parallel software becomes mainstream, the shared memory model reaches its limits regarding ease of programming and efficiency.\u0000 This position paper presents two ideas aiming to solve the problem. First, we restrict the way the programmer expresses parallelism: The program is a collection of possibly recursive tasks, where each task is atomic and cannot communicate with any other task during its execution. Second, we relax the requirement for coherent shared memory: Each task defines its memory footprint, and is guaranteed to have exclusive access to that memory during its execution. Using this model, we can then define a runtime system that transparently performs the data transfers required among cores without cache coherency, and also produces a deterministic execution of the program, provably equivalent to its sequential elision.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131781935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort
Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.
{"title":"The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain","authors":"G. Russell, C. Riley, Neil Henning, Uwe Dolinsky, A. Richards, A. Donaldson, A. V. Amesfoort","doi":"10.1145/1988915.1988925","DOIUrl":"https://doi.org/10.1145/1988915.1988925","url":null,"abstract":"Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116144257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.
{"title":"Approximating inclusion-based points-to analysis","authors":"R. Nasre","doi":"10.1145/1988915.1988931","DOIUrl":"https://doi.org/10.1145/1988915.1988931","url":null,"abstract":"It has been established that achieving a points-to analysis that is scalable in terms of analysis time typically involves trading off analysis precsision and/or memory. In this paper, we propose a novel technique to approximate the solution of an inclusion-based points-to analysis. The technique is based on intelligently approximating pointer- and location-equivalence across variables in the program. We develop a simple approximation algorithm based on the technique. By exploiting various behavioral properties of the solution, we develop another improved algorithm which implements various optimizations related to the merging order, proximity search, lazy merging and identification frequency. The improved algorithm provides a strong control to the client to trade off analysis time and precision as per its requirements. Using a large suite of programs including SPEC 2000 benchmarks and five large open source programs, we show how our algorithm helps achieve a scalable solution.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"48 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123560932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}