Computing environments become increasingly parallel, and it seems likely that we will see more cores on tomorrow's desktops and server platforms. In a highly parallel system, tracing garbage collectors may not scale well due to deep heap structures that hinder parallel tracing. Previous work has discovered vulnerabilities within standard Java benchmarks. In this work we examine these standard benchmarks and analyze them to expose the data structures that make current Java benchmarks create deep heap shapes. It turns out that the problem is manifested mostly with benchmarks that employ queues and linked-lists. We then propose a new construction of a lock-free queue data structure with extra references that enables better garbage collector parallelism at a low overhead.
{"title":"A study of data structures with a deep heap shape","authors":"Haggai Eran, E. Petrank","doi":"10.1145/2492408.2492413","DOIUrl":"https://doi.org/10.1145/2492408.2492413","url":null,"abstract":"Computing environments become increasingly parallel, and it seems likely that we will see more cores on tomorrow's desktops and server platforms. In a highly parallel system, tracing garbage collectors may not scale well due to deep heap structures that hinder parallel tracing. Previous work has discovered vulnerabilities within standard Java benchmarks. In this work we examine these standard benchmarks and analyze them to expose the data structures that make current Java benchmarks create deep heap shapes. It turns out that the problem is manifested mostly with benchmarks that employ queues and linked-lists. We then propose a new construction of a lock-free queue data structure with extra references that enables better garbage collector parallelism at a low overhead.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124482694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Mitesh R. Meswani, Mark Nutter, Mike Ignatowski
As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical for maintaining the performance scaling that many have come to expect from the computing industry. Moving computation closer to main memory presents an opportunity to reduce the overheads associated with data movement. We explore the potential of using 3D die stacking to move memory-intensive computations closer to memory. This approach to processing-in-memory addresses some drawbacks of prior research on in-memory computing and appears commercially viable in the foreseeable future. We show promising early results from this approach and identify areas that are in need of research to unlock its full potential.
{"title":"A new perspective on processing-in-memory architecture design","authors":"D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Mitesh R. Meswani, Mark Nutter, Mike Ignatowski","doi":"10.1145/2492408.2492418","DOIUrl":"https://doi.org/10.1145/2492408.2492418","url":null,"abstract":"As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical for maintaining the performance scaling that many have come to expect from the computing industry. Moving computation closer to main memory presents an opportunity to reduce the overheads associated with data movement. We explore the potential of using 3D die stacking to move memory-intensive computations closer to memory. This approach to processing-in-memory addresses some drawbacks of prior research on in-memory computing and appears commercially viable in the foreseeable future. We show promising early results from this approach and identify areas that are in need of research to unlock its full potential.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131478267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a software-controlled technique for managing the heterogeneous memory resources of next generation multicore platforms with fast 3D die-stacked memory and additional slow off-chip memory. Implemented for virtualized server systems, the technique detects the 'hot' pages critical to program performance in order to then maintain them in the scarce fast 3D memory resources. Challenges overcome for the technique's implementation include the need to minimize its runtime overheads, the lack of hypervisor-level direct visibility into the memory access behavior of guest virtual machines, and the need to make page migration transparent to guests. This paper presents hypervisor-level mechanisms that (i) build a page access history of virtual machines, by periodically scanning page-table access bits and (ii) intercept guest page table operations to create mirrored page-tables and enable guest-transparent page migration. The methods are implemented in the Xen hypervisor and evaluated on a larger scale multicore platform. The resulting ability to characterize the memory behavior of representative server workloads demonstrates the feasibility of software-managed heterogeneous memory resources.
{"title":"Software-controlled transparent management of heterogeneous memory resources in virtualized systems","authors":"Min Lee, Vishal Gupta, K. Schwan","doi":"10.1145/2492408.2492416","DOIUrl":"https://doi.org/10.1145/2492408.2492416","url":null,"abstract":"This paper presents a software-controlled technique for managing the heterogeneous memory resources of next generation multicore platforms with fast 3D die-stacked memory and additional slow off-chip memory. Implemented for virtualized server systems, the technique detects the 'hot' pages critical to program performance in order to then maintain them in the scarce fast 3D memory resources. Challenges overcome for the technique's implementation include the need to minimize its runtime overheads, the lack of hypervisor-level direct visibility into the memory access behavior of guest virtual machines, and the need to make page migration transparent to guests. This paper presents hypervisor-level mechanisms that (i) build a page access history of virtual machines, by periodically scanning page-table access bits and (ii) intercept guest page table operations to create mirrored page-tables and enable guest-transparent page migration. The methods are implemented in the Xen hypervisor and evaluated on a larger scale multicore platform. The resulting ability to characterize the memory behavior of representative server workloads demonstrates the feasibility of software-managed heterogeneous memory resources.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117034573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation. This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.
{"title":"APE: accelerator processor extensions to optimize data-compute co-location","authors":"Ganesh Venkatesh","doi":"10.1145/2492408.2492412","DOIUrl":"https://doi.org/10.1145/2492408.2492412","url":null,"abstract":"Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation.\u0000 This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130973857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we advocate formal locality analysis on memory references of GPGPU kernels. We investigate the locality of reference at different cache levels in the memory hierarchy. At the L1 cache level, we look into the locality behavior at the warp-, the thread block- and the streaming multiprocessor-level. Using matrix multiplication as a case study, we show that our locality analysis accurately captures some interesting and counter-intuitive behavior of the memory accesses. We believe that such analysis will provide very useful insights in understanding the memory accessing behavior and optimizing the memory hierarchy in GPU architectures.
{"title":"Analyzing locality of memory references in GPU architectures","authors":"Saurabh Gupta, Ping Xiang, Huiyang Zhou","doi":"10.1145/2492408.2492423","DOIUrl":"https://doi.org/10.1145/2492408.2492423","url":null,"abstract":"In this paper we advocate formal locality analysis on memory references of GPGPU kernels. We investigate the locality of reference at different cache levels in the memory hierarchy. At the L1 cache level, we look into the locality behavior at the warp-, the thread block- and the streaming multiprocessor-level. Using matrix multiplication as a case study, we show that our locality analysis accurately captures some interesting and counter-intuitive behavior of the memory accesses. We believe that such analysis will provide very useful insights in understanding the memory accessing behavior and optimizing the memory hierarchy in GPU architectures.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127522438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to computer architecture evolution, more and more HPC applications have to include thread-based parallelism and take care of memory consumption. Such evolutions require more attention to the full memory management chain, particularly stressed in multi-threaded context. Several memory allocators provide better scalability on the user-space side. But, with the steadily increasing number of cores, the impact of the operating system cannot be neglected anymore. We measured performance impact of the OS memory sub-system for up to one third of the total execution time of a real application on 128 cores. On modern architectures, we measured that up to 40% of the page fault time is spent in page zeroing. In this paper, we detail a proposal to improve paging performance by removing the needs of this unproductive page zeroing through an extension of the mmap semantic. To this end, we added a kernel-level memory page pool per process to locally reuse free pages without content reset. Our experiments show significant performance improvements especially for huge pages.
{"title":"Introducing kernel-level page reuse for high performance computing","authors":"S. Valat, Marc Pérache, W. Jalby","doi":"10.1145/2492408.2492414","DOIUrl":"https://doi.org/10.1145/2492408.2492414","url":null,"abstract":"Due to computer architecture evolution, more and more HPC applications have to include thread-based parallelism and take care of memory consumption. Such evolutions require more attention to the full memory management chain, particularly stressed in multi-threaded context. Several memory allocators provide better scalability on the user-space side. But, with the steadily increasing number of cores, the impact of the operating system cannot be neglected anymore. We measured performance impact of the OS memory sub-system for up to one third of the total execution time of a real application on 128 cores. On modern architectures, we measured that up to 40% of the page fault time is spent in page zeroing. In this paper, we detail a proposal to improve paging performance by removing the needs of this unproductive page zeroing through an extension of the mmap semantic. To this end, we added a kernel-level memory page pool per process to locally reuse free pages without content reset. Our experiments show significant performance improvements especially for huge pages.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124978987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks. We show that the sharing is not well exploited because current GPU runtime ignores such a factor when scheduling threads. We then present an application-level transformation to remap thread blocks to data on the fly. With the software-level scheduler, thread blocks with much data sharing are scheduled to share the cache on a streaming multiprocessor (SM). Experiments on four benchmarks show 1.23X speedup on average.
{"title":"Software-level scheduling to exploit non-uniformly shared data cache on GPGPU","authors":"Bo Wu, Weilin Wang, Xipeng Shen","doi":"10.1145/2492408.2492421","DOIUrl":"https://doi.org/10.1145/2492408.2492421","url":null,"abstract":"Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks. We show that the sharing is not well exploited because current GPU runtime ignores such a factor when scheduling threads. We then present an application-level transformation to remap thread blocks to data on the fly. With the software-level scheduler, thread blocks with much data sharing are scheduled to share the cache on a streaming multiprocessor (SM). Experiments on four benchmarks show 1.23X speedup on average.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125046294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A "hot" concept in program optimization is hotness. For example, program optimization targets hot paths, and register allocation targets hot variables. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. In this paper, we define a new metric called "coldness" and show how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases.
{"title":"A coldness metric for cache optimization","authors":"Raj Parihar, C. Ding, Michael C. Huang","doi":"10.1145/2492408.2492419","DOIUrl":"https://doi.org/10.1145/2492408.2492419","url":null,"abstract":"A \"hot\" concept in program optimization is hotness. For example, program optimization targets hot paths, and register allocation targets hot variables. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. In this paper, we define a new metric called \"coldness\" and show how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"94 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133036905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the number of transistors on a chip increases, they are used mainly in two ways on multicore processors: first, to increase the number of cores, and second, to increase the size of cache memory. The two approaches intersect at a basic problem, which is how parallel tasks can best share the cache memory. The degree of sharing determines the available cache resource for each core and hence the memory performance and scalability of the system. In this paper, cache rationing is presented as a cache sharing solution for collaborative caching.
{"title":"Cache rationing for multicore","authors":"Jacob Brock, C. Ding","doi":"10.1145/2492408.2492422","DOIUrl":"https://doi.org/10.1145/2492408.2492422","url":null,"abstract":"As the number of transistors on a chip increases, they are used mainly in two ways on multicore processors: first, to increase the number of cores, and second, to increase the size of cache memory. The two approaches intersect at a basic problem, which is how parallel tasks can best share the cache memory. The degree of sharing determines the available cache resource for each core and hence the memory performance and scalability of the system. In this paper, cache rationing is presented as a cache sharing solution for collaborative caching.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116831014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a new metric called all-window liveness, which is the average amount of live data in all time windows of a given length. The paper gives a linear-time algorithm to compute the average liveness for all window lengths and discusses potential uses of the new metric.
{"title":"All-window data liveness","authors":"Pengcheng Li, C. Ding","doi":"10.1145/2492408.2492420","DOIUrl":"https://doi.org/10.1145/2492408.2492420","url":null,"abstract":"This paper proposes a new metric called all-window liveness, which is the average amount of live data in all time windows of a given length. The paper gives a linear-time algorithm to compute the average liveness for all window lengths and discusses potential uses of the new metric.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114699233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}