Workshop on Memory System Performance and Correctness最新文献

英文中文

A study of data structures with a deep heap shape 具有深堆形状的数据结构研究

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492413

Haggai Eran, E. Petrank

Computing environments become increasingly parallel, and it seems likely that we will see more cores on tomorrow's desktops and server platforms. In a highly parallel system, tracing garbage collectors may not scale well due to deep heap structures that hinder parallel tracing. Previous work has discovered vulnerabilities within standard Java benchmarks. In this work we examine these standard benchmarks and analyze them to expose the data structures that make current Java benchmarks create deep heap shapes. It turns out that the problem is manifested mostly with benchmarks that employ queues and linked-lists. We then propose a new construction of a lock-free queue data structure with extra references that enables better garbage collector parallelism at a low overhead.

计算环境变得越来越并行，我们很可能会在未来的桌面和服务器平台上看到更多的内核。在高度并行的系统中，由于深堆结构阻碍并行跟踪，跟踪垃圾收集器可能无法很好地扩展。以前的工作已经发现了标准Java基准中的漏洞。在本文中，我们将研究这些标准基准并对其进行分析，以揭示使当前Java基准创建深堆形状的数据结构。事实证明，这个问题主要出现在使用队列和链表的基准测试中。然后，我们提出了一种无锁队列数据结构的新构造，它带有额外的引用，可以在低开销的情况下实现更好的垃圾收集器并行性。

引用次数: 5

A new perspective on processing-in-memory architecture design 内存中处理架构设计的新视角

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492418

D. Zhang, N. Jayasena, Alexander Lyashevsky, J. Greathouse, Mitesh R. Meswani, Mark Nutter, Mike Ignatowski

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical for maintaining the performance scaling that many have come to expect from the computing industry. Moving computation closer to main memory presents an opportunity to reduce the overheads associated with data movement. We explore the potential of using 3D die stacking to move memory-intensive computations closer to memory. This approach to processing-in-memory addresses some drawbacks of prior research on in-memory computing and appears commercially viable in the foreseeable future. We show promising early results from this approach and identify areas that are in need of research to unlock its full potential.

随着计算越来越受到数据移动和能源消耗的限制，利用整个内存层次结构中的局部性对于保持许多人从计算行业期望的性能扩展变得至关重要。将计算移到主存附近提供了减少与数据移动相关的开销的机会。我们探索使用3D芯片堆叠的潜力，以移动内存密集型计算更接近内存。这种在内存中处理的方法解决了先前在内存计算研究中的一些缺点，并且在可预见的未来似乎具有商业可行性。我们从这种方法中展示了有希望的早期结果，并确定了需要研究的领域，以释放其全部潜力。

引用次数: 47

Software-controlled transparent management of heterogeneous memory resources in virtualized systems 虚拟化系统中异构内存资源的软件控制透明管理

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492416

Min Lee, Vishal Gupta, K. Schwan

This paper presents a software-controlled technique for managing the heterogeneous memory resources of next generation multicore platforms with fast 3D die-stacked memory and additional slow off-chip memory. Implemented for virtualized server systems, the technique detects the 'hot' pages critical to program performance in order to then maintain them in the scarce fast 3D memory resources. Challenges overcome for the technique's implementation include the need to minimize its runtime overheads, the lack of hypervisor-level direct visibility into the memory access behavior of guest virtual machines, and the need to make page migration transparent to guests. This paper presents hypervisor-level mechanisms that (i) build a page access history of virtual machines, by periodically scanning page-table access bits and (ii) intercept guest page table operations to create mirrored page-tables and enable guest-transparent page migration. The methods are implemented in the Xen hypervisor and evaluated on a larger scale multicore platform. The resulting ability to characterize the memory behavior of representative server workloads demonstrates the feasibility of software-managed heterogeneous memory resources.

本文提出了一种软件控制技术，用于管理下一代多核平台的异构存储资源，该平台具有快速3D模堆叠存储器和额外的慢速片外存储器。该技术用于虚拟化服务器系统，检测对程序性能至关重要的“热”页面，然后在稀缺的快速3D内存资源中维护它们。该技术实现所克服的挑战包括:需要最小化其运行时开销、缺乏对来宾虚拟机内存访问行为的管理程序级别的直接可见性，以及需要使页面迁移对来宾透明。本文介绍了管理程序级别的机制:(i)通过定期扫描页表访问位来构建虚拟机的页面访问历史;(ii)拦截来宾页表操作以创建镜像页表并启用来宾透明页迁移。这些方法在Xen管理程序中实现，并在更大规模的多核平台上进行了评估。由此产生的表征代表性服务器工作负载的内存行为的能力证明了软件管理异构内存资源的可行性。

{"title":"Software-controlled transparent management of heterogeneous memory resources in virtualized systems","authors":"Min Lee, Vishal Gupta, K. Schwan","doi":"10.1145/2492408.2492416","DOIUrl":"https://doi.org/10.1145/2492408.2492416","url":null,"abstract":"This paper presents a software-controlled technique for managing the heterogeneous memory resources of next generation multicore platforms with fast 3D die-stacked memory and additional slow off-chip memory. Implemented for virtualized server systems, the technique detects the 'hot' pages critical to program performance in order to then maintain them in the scarce fast 3D memory resources. Challenges overcome for the technique's implementation include the need to minimize its runtime overheads, the lack of hypervisor-level direct visibility into the memory access behavior of guest virtual machines, and the need to make page migration transparent to guests. This paper presents hypervisor-level mechanisms that (i) build a page access history of virtual machines, by periodically scanning page-table access bits and (ii) intercept guest page table operations to create mirrored page-tables and enable guest-transparent page migration. The methods are implemented in the Xen hypervisor and evaluated on a larger scale multicore platform. The resulting ability to characterize the memory behavior of representative server workloads demonstrates the feasibility of software-managed heterogeneous memory resources.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117034573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

APE: accelerator processor extensions to optimize data-compute co-location APE:加速器处理器扩展，以优化数据计算协同定位

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492412

Ganesh Venkatesh

Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation. This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.

在当今的系统中，我们注意到两种技术趋势，一是向多核心系统迈进，二是更加关注功率效率。核心数量的增加将导致每个计算节点的缓存更小，并且更依赖于在应用程序中暴露任务级并行性。然而，这可能会增加在不同任务内部和之间移动的数据量，从而增加相关的电力成本。这将给本已电力紧张的现有系统带来新的负担。随着技术的发展，这种情况只会变得更糟，因为随着每一代技术的发展，电线消耗的电力并没有减少多少，但这些电线传输的数据量却在每一代增加。本文通过确定内存访问模式来解决这个问题，这些模式占数据移动的大部分，并设计处理器扩展来支持它们。这些处理器扩展更靠近缓存结构，而不是核心管道，以减少数据移动并改进计算-数据协同定位。我们表明，通过这样做，我们能够将任务的内存访问减少2.5倍，数据移动减少4倍，缓存丢失率减少40%，适用于广泛的应用程序。

{"title":"APE: accelerator processor extensions to optimize data-compute co-location","authors":"Ganesh Venkatesh","doi":"10.1145/2492408.2492412","DOIUrl":"https://doi.org/10.1145/2492408.2492412","url":null,"abstract":"Two technological trends we notice in the current day systems is the march towards many core systems and greater focus on power efficiency. The increase in core counts would result in smaller caches-per-compute node and greater reliance on exposing task-level parallelism in applications. However, this would potentially increase the amount of data that moves within and between the different tasks and hence, the related power costs. This will pose a new burden on the already power-constrained current day systems. The situation would only get worse as we go forward because the power consumed by the wires is not scaling down much with each technology generation, but the amount of data that these wires move is increasing per generation.\u0000 This paper addresses this concern by identifying the memory access patterns that accounts for much of the data movement and designing processor extensions, Apes to support them. These processor extensions are placed closer to the cache structures, rather than the core pipeline, to reduce the data movement and improve compute-data co-location. We show that by doing this we are able to reduce a task's memory accesses by ~2.5×, data movement by 4× and cache miss rate by 40% for a wide range of applications.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130973857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Analyzing locality of memory references in GPU architectures 分析GPU架构中内存引用的局部性

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492423

Saurabh Gupta, Ping Xiang, Huiyang Zhou

In this paper we advocate formal locality analysis on memory references of GPGPU kernels. We investigate the locality of reference at different cache levels in the memory hierarchy. At the L1 cache level, we look into the locality behavior at the warp-, the thread block- and the streaming multiprocessor-level. Using matrix multiplication as a case study, we show that our locality analysis accurately captures some interesting and counter-intuitive behavior of the memory accesses. We believe that such analysis will provide very useful insights in understanding the memory accessing behavior and optimizing the memory hierarchy in GPU architectures.

本文提倡对GPGPU内核的内存引用进行形式化局部性分析。我们研究了在内存层次结构中不同缓存级别上引用的局部性。在L1缓存级别，我们研究了warp、线程块和流多处理器级别的局部性行为。使用矩阵乘法作为案例研究，我们表明我们的局部性分析准确地捕获了内存访问的一些有趣的和反直觉的行为。我们相信这样的分析将为理解内存访问行为和优化GPU架构中的内存层次结构提供非常有用的见解。

引用次数: 9

Introducing kernel-level page reuse for high performance computing 为高性能计算引入内核级页面重用

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492414

S. Valat, Marc Pérache, W. Jalby

Due to computer architecture evolution, more and more HPC applications have to include thread-based parallelism and take care of memory consumption. Such evolutions require more attention to the full memory management chain, particularly stressed in multi-threaded context. Several memory allocators provide better scalability on the user-space side. But, with the steadily increasing number of cores, the impact of the operating system cannot be neglected anymore. We measured performance impact of the OS memory sub-system for up to one third of the total execution time of a real application on 128 cores. On modern architectures, we measured that up to 40% of the page fault time is spent in page zeroing. In this paper, we detail a proposal to improve paging performance by removing the needs of this unproductive page zeroing through an extension of the mmap semantic. To this end, we added a kernel-level memory page pool per process to locally reuse free pages without content reset. Our experiments show significant performance improvements especially for huge pages.

由于计算机体系结构的发展，越来越多的HPC应用程序必须包含基于线程的并行性并考虑内存消耗。这种演变需要更多地关注完整的内存管理链，特别是在多线程上下文中。几个内存分配器在用户空间端提供了更好的可伸缩性。但是，随着核心数量的稳步增加，操作系统的影响也不能再被忽视了。我们测量了操作系统内存子系统对128核实际应用程序总执行时间的三分之一的性能影响。在现代体系结构中，我们测量了高达40%的页面故障时间花在页面归零上。在本文中，我们详细介绍了一个改进分页性能的建议，该建议通过扩展mmap语义来消除这种非生产性页面归零的需求。为此，我们为每个进程添加了一个内核级内存页池，以便在不重置内容的情况下本地重用空闲页。我们的实验显示了显著的性能改进，特别是对于巨大的页面。

引用次数: 10

Software-level scheduling to exploit non-uniformly shared data cache on GPGPU 利用GPGPU非均匀共享数据缓存的软件级调度

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492421

Bo Wu, Weilin Wang, Xipeng Shen

Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks. We show that the sharing is not well exploited because current GPU runtime ignores such a factor when scheduling threads. We then present an application-level transformation to remap thread blocks to data on the fly. With the software-level scheduler, thread blocks with much data sharing are scheduled to share the cache on a streaming multiprocessor (SM). Experiments on four benchmarks show 1.23X speedup on average.

在gpu中引入数据缓存来缓解内存访问不规律的问题。但很少有研究调查如何充分利用其潜力。在这项工作中，我们考虑了一些重要的GPU应用程序，这些应用程序具有跨线程块的数据共享功能。我们表明，共享没有得到很好的利用，因为当前的GPU运行时在调度线程时忽略了这样一个因素。然后，我们提供一个应用程序级别的转换，以便动态地将线程块重新映射到数据。使用软件级调度器，可以调度具有大量数据共享的线程块来共享流多处理器(SM)上的缓存。在四个基准测试上的实验显示，平均速度提高了1.23倍。

引用次数: 0

A coldness metric for cache optimization 缓存优化的冷度度量

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492419

Raj Parihar, C. Ding, Michael C. Huang

A "hot" concept in program optimization is hotness. For example, program optimization targets hot paths, and register allocation targets hot variables. Cache optimization, however, has to target cold data, which are less frequently used and tend to cause cache misses whenever they are accessed. Hot data, in contrast, as they are small and frequently used, tend to stay in cache. In this paper, we define a new metric called "coldness" and show how the coldness varies across programs and how much colder the data we have to optimize as the cache size on modern machines increases.

程序优化中的一个“热门”概念是热度。例如，程序优化以热路径为目标，寄存器分配以热变量为目标。然而，缓存优化必须针对冷数据，这些数据使用频率较低，并且在访问它们时往往会导致缓存丢失。相反，由于热数据很小且经常被使用，因此它们倾向于留在缓存中。在本文中，我们定义了一个称为“冷度”的新度量，并展示了冷度在不同程序之间的变化，以及随着现代机器上缓存大小的增加，我们必须优化的数据有多冷。

引用次数: 1

Cache rationing for multicore 多核缓存配给

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492422

Jacob Brock, C. Ding

As the number of transistors on a chip increases, they are used mainly in two ways on multicore processors: first, to increase the number of cores, and second, to increase the size of cache memory. The two approaches intersect at a basic problem, which is how parallel tasks can best share the cache memory. The degree of sharing determines the available cache resource for each core and hence the memory performance and scalability of the system. In this paper, cache rationing is presented as a cache sharing solution for collaborative caching.

随着芯片上晶体管数量的增加，它们在多核处理器上主要有两种用途:第一，增加核心数量，第二，增加高速缓存存储器的大小。这两种方法在一个基本问题上有交集，即并行任务如何最好地共享缓存内存。共享的程度决定了每个核心可用的缓存资源，从而决定了系统的内存性能和可伸缩性。本文提出了一种用于协作缓存的缓存共享解决方案——缓存配给。

引用次数: 0

All-window data liveness 全窗口数据活动性

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492420

Pengcheng Li, C. Ding

This paper proposes a new metric called all-window liveness, which is the average amount of live data in all time windows of a given length. The paper gives a linear-time algorithm to compute the average liveness for all window lengths and discusses potential uses of the new metric.

本文提出了一个新的度量，称为全窗口活跃度，它是给定长度的所有时间窗口的平均活跃数据量。本文给出了一种线性时间算法来计算所有窗长的平均活度，并讨论了新度量的潜在用途。

引用次数: 8

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Workshop on Memory System Performance and Correctness

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀