Workshop on Memory System Performance and Correctness最新文献

英文中文

A low overhead method for recovering unused memory inside regions 用于恢复区域内未使用内存的低开销方法

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492415

Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard

Automating memory management improves both resource safety and programmer productivity. One approach, region-based memory management [9] (RBMM), applies compile-time reasoning to identify points in a program at which memory can be safely reclaimed. The main advantage of RBMM over traditional garbage collection (GC) is the avoidance of expensive runtime analysis, which makes reclaiming memory much faster. On the other hand, GC requires no static analysis, and, operating at runtime, can have significantly more accurate information about object lifetimes. In this paper we propose a hybrid system that seeks to combine the advantages of both methods while avoiding the overheads that previous hybrid systems incurred. Our system can also reclaim array segments whose elements are no longer reachable.

自动化内存管理可以提高资源安全性和程序员的工作效率。一种方法是基于区域的内存管理[9](RBMM)，它应用编译时推理来确定程序中可以安全回收内存的点。RBMM相对于传统垃圾收集(GC)的主要优点是避免了昂贵的运行时分析，这使得回收内存的速度要快得多。另一方面，GC不需要静态分析，并且在运行时操作，可以获得关于对象生命周期的更准确的信息。在本文中，我们提出了一种混合系统，它寻求结合两种方法的优点，同时避免了以前混合系统产生的开销。我们的系统还可以回收那些元素不再可达的数组段。

引用次数: 7

Program-centric cost models for locality 地方性的以项目为中心的成本模型

Workshop on Memory System Performance and Correctness

Pub Date : 2013-06-16 DOI: 10.1145/2492408.2492417

G. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, H. Simhadri

In this position paper, we argue that cost models for locality in parallel machines should be program-centric, not machine-centric.

在本文中，我们认为并行机器的局部性成本模型应该以程序为中心，而不是以机器为中心。

引用次数: 4

Can seqlocks get along with programming language memory models? 序列锁可以与编程语言内存模型相处?

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247688

H. Boehm

Seqlocks are an important synchronization mechanism and represent a significant improvement over conventional reader-writer locks in some contexts. They avoid the need to update a synchronization variable during a reader critical section, and hence improve performance by avoiding cache coherence misses on the lock object itself. Unfortunately, they rely on speculative racing loads inside the critical section. This makes them an interesting problem case for programming-language-level memory models that emphasize data-race-free programming. We analyze a variety of implementation alternatives within the C++11 memory model, and briefly address the corresponding issue in Java. In the process, we observe that there may be a use for "read-dont-modify-write" operations, i. e. read-modify-write operations that atomically write back the original value, without modifying it, solely for the memory model consequences, and that it may be useful for compilers to optimize such operations.

序列锁是一种重要的同步机制，在某些上下文中比传统的读写锁有了显著的改进。它们避免了在读取临界区期间更新同步变量的需要，从而通过避免锁对象本身的缓存一致性丢失来提高性能。不幸的是，它们依赖于临界区域内的投机竞速载荷。这使得它们成为强调无数据竞争编程的编程语言级内存模型的一个有趣的问题案例。我们分析了c++ 11内存模型中的各种实现方案，并简要地解决了Java中的相应问题。在这个过程中，我们观察到“读-不修改-写”操作可能会有用处，即读-修改-写操作会自动回写原始值，而不修改它，仅仅是为了内存模型的结果，并且它可能对编译器优化这种操作很有用。

引用次数: 49

Rank idle time prediction driven last-level cache writeback Rank空闲时间预测驱动最后一级缓存回写

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247690

Zhe Wang, S. Khan, Daniel A. Jiménez

In modern DDRx memory systems, memory write requests can cause significant performance loss by increasing the memory access latency for subsequent read requests targeting the same device. In this paper, we propose a rank idle time prediction driven last-level cache writeback technique. This technique uses a rank idle time predictor to predict long phases of idle rank cycles. The scheduled dirty cache blocks generated from last-level cache are written back during the predicted long idle rank period. This technique allows servicing write request at the point that minimize the delay it caused to the following read requests. Write-induced interference can be significantly reduced by using our technique. We evaluate our technique using cycle-accurate full-system simulator and SPEC CPU2006 benchmarks. The results shows the technique improves performance in an eight-core system with memory-intensive workloads on average by 10.5% and 10.1% over conventional writeback using two-rank and four-rank DRAM configurations respectively.

在现代DDRx内存系统中，内存写请求会增加针对同一设备的后续读请求的内存访问延迟，从而导致显著的性能损失。本文提出了一种基于空闲时间预测的最后一级缓存回写技术。该技术使用秩空闲时间预测器来预测空闲秩周期的长阶段。在预期的长空闲秩期间，从最后一级缓存生成的计划脏缓存块被写回。这种技术允许在对写请求进行服务时，最大限度地减少它对后续读请求造成的延迟。使用我们的技术可以显著减少写诱发干扰。我们使用周期精确的全系统模拟器和SPEC CPU2006基准测试来评估我们的技术。结果表明，与使用二级和四级DRAM配置的传统回写相比，该技术在具有内存密集型工作负载的八核系统中的性能平均提高了10.5%和10.1%。

引用次数: 6

Towards region-based memory management for Go 面向Go的基于区域的内存管理

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247695

Matthew Davis, P. Schachte, Z. Somogyi, H. Søndergaard

Region-based memory management aims to lower the cost of deallocation through bulk processing: instead of recovering the memory of each object separately, it recovers the memory of a region containing many objects. It relies on static analysis to determine the set of memory regions needed by a program, the program points at which each region should be created and removed, and, for each memory allocation, the region that should supply the memory. The concurrent language Go has features that pose interesting challenges for this analysis. We present a novel design for region-based memory management for Go, combining static analysis, to guide region creation, and lightweight runtime bookkeeping, to help control reclamation. The main advantage of our approach is that it greatly limits the amount of re-work that must be done after each change to the program source code, making our approach more practical than existing RBMM systems. Our prototype implementation covers most of the sequential fragment of Go, and preliminary results are encouraging.

基于区域的内存管理旨在通过批量处理降低释放成本:它不是单独恢复每个对象的内存，而是恢复包含许多对象的区域的内存。它依赖于静态分析来确定程序所需的内存区域集，每个区域应该创建和删除的程序点，以及对于每个内存分配，应该提供内存的区域。并发语言Go的一些特性给这个分析带来了有趣的挑战。我们提出了一种基于区域的Go内存管理的新设计，结合了静态分析来指导区域创建，以及轻量级运行时簿记来帮助控制回收。我们的方法的主要优点是，它极大地限制了每次更改程序源代码后必须完成的重新工作的数量，使我们的方法比现有的RBMM系统更实用。我们的原型实现涵盖了Go的大部分顺序片段，初步结果令人鼓舞。

引用次数: 9

A higher order theory of locality 一个高阶的局部性理论

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247697

C. Ding, Xiaoya Xiang

This short paper outlines a theory for deriving the traditional metrics of miss rate and reuse distance from a single measure called the footprint. It gives the correctness condition and discusses the uses of the new theory in on-line locality analysis and multicore cache management.

这篇短文概述了一种理论，用于从称为足迹的单一度量中推导出缺失率和重用距离的传统度量。给出了正确的条件，并讨论了新理论在在线局域分析和多核缓存管理中的应用。

引用次数: 7

Parallel memory defragmentation on a GPU GPU上的并行内存碎片整理

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247693

R. Veldema, M. Philippsen

High-throughput memory management techniques such as malloc/free or mark-and-sweep collectors often exhibit memory fragmentation leaving allocated objects interspersed with free memory holes. Memory defragmentation removes such holes by moving objects around in memory so that they become adjacent (compaction) and holes can be merged (coalesced) to form larger holes. However, known defragmentation techniques are slow. This paper presents a parallel solution to best-effort partial defragmentation that makes use of all available cores. The solution not only speeds up defragmentation times significantly, but it also scales for many simple cores. It can therefore even be implemented on a GPU. One problem with compaction is that it requires all references to moved objects to be retargeted to point to their new locations. This paper further improves existing work by a better identification of the parts of the heap that contain references to objects moved by the compactor and only processes these parts to find the references that are then retargeted in parallel. To demonstrate the performance of the new memory defragmentation algorithm on many-core processors, we show its performance on a modern GPU. Parallelization speeds up compaction 40 times and coalescing up to 32 times. After compaction, our algorithm only needs to process 2%--4% of the total heap to retarget references.

高吞吐量内存管理技术，如malloc/free或标记-清除收集器，通常会出现内存碎片，使分配的对象散布在空闲内存洞中。内存碎片整理通过在内存中移动对象来消除这些洞，这样它们就变得相邻(压缩)，并且洞可以合并(合并)以形成更大的洞。然而，已知的碎片整理技术是缓慢的。本文提出了一种利用所有可用内核的并行解决方案，以实现最佳的部分碎片整理。该解决方案不仅大大加快了碎片整理时间，而且还可以扩展到许多简单的内核。因此，它甚至可以在GPU上实现。压缩的一个问题是，它要求对移动对象的所有引用都要重新定位，以指向它们的新位置。本文通过更好地识别堆中包含对由压缩器移动的对象的引用的部分，进一步改进了现有的工作，并且只处理这些部分以查找随后并行重定向的引用。为了演示新的内存碎片整理算法在多核处理器上的性能，我们展示了它在现代GPU上的性能。并行化将压缩速度提高了40倍，合并速度提高了32倍。在压缩之后，我们的算法只需要处理总堆的2%- 4%来重定向引用。

{"title":"Parallel memory defragmentation on a GPU","authors":"R. Veldema, M. Philippsen","doi":"10.1145/2247684.2247693","DOIUrl":"https://doi.org/10.1145/2247684.2247693","url":null,"abstract":"High-throughput memory management techniques such as malloc/free or mark-and-sweep collectors often exhibit memory fragmentation leaving allocated objects interspersed with free memory holes. Memory defragmentation removes such holes by moving objects around in memory so that they become adjacent (compaction) and holes can be merged (coalesced) to form larger holes. However, known defragmentation techniques are slow. This paper presents a parallel solution to best-effort partial defragmentation that makes use of all available cores. The solution not only speeds up defragmentation times significantly, but it also scales for many simple cores. It can therefore even be implemented on a GPU.\u0000 One problem with compaction is that it requires all references to moved objects to be retargeted to point to their new locations. This paper further improves existing work by a better identification of the parts of the heap that contain references to objects moved by the compactor and only processes these parts to find the references that are then retargeted in parallel.\u0000 To demonstrate the performance of the new memory defragmentation algorithm on many-core processors, we show its performance on a modern GPU. Parallelization speeds up compaction 40 times and coalescing up to 32 times. After compaction, our algorithm only needs to process 2%--4% of the total heap to retarget references.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133281312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Analysis of pure methods using garbage collection 分析使用垃圾回收的纯方法

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247694

Erik Österlund, Welf Löwe

Parallelization and other optimizations often depend on static dependence analysis. This approach requires methods to be independent regardless of the input data, which is not always the case. Our contribution is a dynamic analysis "guessing" if methods are pure, i. e., if they do not change state. The analysis is piggybacking on a garbage collector, more specifically, a concurrent, replicating garbage collector. It guesses whether objects are immutable by looking at actual mutations observed by the garbage collector. The analysis is essentially for free. In fact, our concurrent garbage collector including analysis outperforms Boehm's stop-the-world collector (without any analysis), as we show in experiments. Moreover, false guesses can be rolled back efficiently. The results can be used for just-in-time parallelization allowing an automatic parallelization of methods that are pure over certain periods of time. Hence, compared to parallelization based on static dependence analysis, more programs potentially benefit from parallelization.

并行化和其他优化通常依赖于静态依赖分析。这种方法要求无论输入数据如何，方法都是独立的，但情况并非总是如此。我们的贡献是动态分析，“猜测”方法是否纯粹，也就是说，如果它们不改变状态。该分析依赖于一个垃圾收集器，更具体地说，是一个并发的、复制的垃圾收集器。它通过查看垃圾收集器观察到的实际变化来猜测对象是否不可变。分析基本上是免费的。事实上，我们包含分析的并发垃圾收集器优于Boehm的stop-the-world收集器(没有任何分析)，正如我们在实验中所展示的那样。此外，错误的猜测可以有效地回滚。结果可用于实时并行化，允许在特定时间段内对纯方法进行自动并行化。因此，与基于静态依赖分析的并行化相比，更多的程序可能受益于并行化。

引用次数: 2

Can parallel data structures rely on automatic memory managers? 并行数据结构可以依赖于自动内存管理器吗?

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247685

E. Petrank

The complexity of parallel data structures is often measured by two major factors: the throughput they provide and the progress they guarantee. Progress guarantees are particularly important for systems that require responsiveness such as real-time systems, operating systems, interactive systems, etc. Notions of progress guarantees such as lock-freedom, wait-freedom, and obstruction-freedom that provide different levels of guarantees have been proposed in the literature [4, 6]. Concurrent access (and furthermore, optimistic access) to shared objects makes the management of memory one of the more complex aspects of concurrent algorithms design. The use of automatic memory management greatly simplifies such algorithms [11, 3, 2, 9]. However, while the existence of lock-free garbage collection has been demonstrated [5], the existence of a practical automatic memory manager that supports lock-free or wait-free algorithms is still open. Furthermore, known schemes for manual reclamation of unused objects are difficult to use and impose a significant overhead on the execution [10]. It turns out that the memory management community is not fully aware of how dire the need is for memory managers that support progress guarantees for the design of concurrent data structures. Likewise, designers of concurrent data structures are not always aware of the fact that memory management with support for progress guarantees is not available. Closing this gap between these two communities is a major open problem for both communities. In this talk we will examine the memory management needs of concurrent algorithms. Next, we will discuss how state-of-the-art research and practice deal with the fact that an important piece of technology is missing (e.g., [7, 1]). Finally, we will survey the currently available pieces in this puzzle (e.g., [13, 12, 8]) and specify which pieces are missing. This open problem is arguably the greatest challenge facing the memory management community today.

并行数据结构的复杂性通常由两个主要因素来衡量:它们提供的吞吐量和它们保证的进度。对于需要响应性的系统，如实时系统、操作系统、交互系统等，进度保证尤为重要。文献[4,6]中已经提出了进度保证的概念，如锁自由、等待自由和阻碍自由，它们提供了不同级别的保证。对共享对象的并发访问(以及进一步的乐观访问)使得内存管理成为并发算法设计中较为复杂的方面之一。自动内存管理的使用大大简化了这类算法[11,3,2,9]。然而，尽管无锁垃圾收集的存在已经得到了证明，但支持无锁或无等待算法的实用自动内存管理器的存在仍然是开放的。此外，已知的手动回收未使用对象的方案很难使用，并且会给执行带来很大的开销。事实证明，内存管理社区并没有完全意识到对支持并发数据结构设计的进度保证的内存管理器的需求有多么迫切。同样，并发数据结构的设计者并不总是意识到支持进程保证的内存管理是不可用的这一事实。缩小这两个社区之间的差距对两个社区来说都是一个悬而未决的重大问题。在这次演讲中，我们将研究并发算法的内存管理需求。接下来，我们将讨论最先进的研究和实践如何处理缺少重要技术的事实(例如，[7,1])。最后，我们将调查这个谜题中当前可用的部分(例如，[13,12,8])并指定缺少哪些部分。这个悬而未决的问题可以说是当前内存管理社区面临的最大挑战。

{"title":"Can parallel data structures rely on automatic memory managers?","authors":"E. Petrank","doi":"10.1145/2247684.2247685","DOIUrl":"https://doi.org/10.1145/2247684.2247685","url":null,"abstract":"The complexity of parallel data structures is often measured by two major factors: the throughput they provide and the progress they guarantee. Progress guarantees are particularly important for systems that require responsiveness such as real-time systems, operating systems, interactive systems, etc. Notions of progress guarantees such as lock-freedom, wait-freedom, and obstruction-freedom that provide different levels of guarantees have been proposed in the literature [4, 6]. Concurrent access (and furthermore, optimistic access) to shared objects makes the management of memory one of the more complex aspects of concurrent algorithms design. The use of automatic memory management greatly simplifies such algorithms [11, 3, 2, 9]. However, while the existence of lock-free garbage collection has been demonstrated [5], the existence of a practical automatic memory manager that supports lock-free or wait-free algorithms is still open. Furthermore, known schemes for manual reclamation of unused objects are difficult to use and impose a significant overhead on the execution [10].\u0000 It turns out that the memory management community is not fully aware of how dire the need is for memory managers that support progress guarantees for the design of concurrent data structures. Likewise, designers of concurrent data structures are not always aware of the fact that memory management with support for progress guarantees is not available. Closing this gap between these two communities is a major open problem for both communities.\u0000 In this talk we will examine the memory management needs of concurrent algorithms. Next, we will discuss how state-of-the-art research and practice deal with the fact that an important piece of technology is missing (e.g., [7, 1]). Finally, we will survey the currently available pieces in this puzzle (e.g., [13, 12, 8]) and specify which pieces are missing. This open problem is arguably the greatest challenge facing the memory management community today.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117218114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Trace-driven simulation of memory system scheduling in multithread application 多线程应用中内存系统调度的跟踪驱动仿真

Workshop on Memory System Performance and Correctness

Pub Date : 2012-06-16 DOI: 10.1145/2247684.2247691

Peng Fei Zhu, Mingyu Chen, Yungang Bao, Licheng Chen, Yongbing Huang

Along with commercial chip-multiprocessors (CMPs) integrating more and more cores, memory systems are playing an increasingly important role in multithread applications. Currently, trace-driven simulation is widely adopted in memory system scheduling research, since it is faster than execution-driven simulation and does not require data computation. On the contrary, due to the same reason, its trace replay for concurrent thread execution lacks data information and contains only addresses, so misplacement occurs in simulations when the trace of one thread runs ahead or behind others. This kind of distortion can cause remarkable errors during research. As shown in our experiment, trace misplacement causes an error rate of up to 10.22% in the metrics, including weighted IPC speedup, harmonic mean of IPC, and CPI throughput. This paper presents a methodology to avoid trace misplacement in trace-driven simulation and to ensure the accuracy of memory scheduling simulation in multithread applications, thus revealing a reliable means to study inter-thread actions in memory systems.

随着商用芯片多处理器(cmp)集成越来越多的核心，存储系统在多线程应用中扮演着越来越重要的角色。由于跟踪驱动仿真比执行驱动仿真速度快，且不需要数据计算，目前在内存系统调度研究中被广泛采用。相反，由于同样的原因，它对并发线程执行的跟踪重放缺乏数据信息，只包含地址，因此在模拟中，当一个线程的跟踪运行在其他线程的前面或后面时，会发生错位。这种扭曲会在研究过程中造成显著的误差。如我们的实验所示，跟踪错位导致的指标错误率高达10.22%，包括加权IPC加速、IPC谐波平均值和CPI吞吐量。本文提出了一种在跟踪驱动仿真中避免跟踪错位和保证多线程应用中内存调度仿真准确性的方法，从而为研究内存系统中的线程间行为提供了一种可靠的方法。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Workshop on Memory System Performance and Correctness

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀