Parallel memory defragmentation on a GPU

Workshop on Memory System Performance and Correctness Pub Date : 2012-06-16 DOI:10.1145/2247684.2247693

R. Veldema, M. Philippsen

{"title":"Parallel memory defragmentation on a GPU","authors":"R. Veldema, M. Philippsen","doi":"10.1145/2247684.2247693","DOIUrl":null,"url":null,"abstract":"High-throughput memory management techniques such as malloc/free or mark-and-sweep collectors often exhibit memory fragmentation leaving allocated objects interspersed with free memory holes. Memory defragmentation removes such holes by moving objects around in memory so that they become adjacent (compaction) and holes can be merged (coalesced) to form larger holes. However, known defragmentation techniques are slow. This paper presents a parallel solution to best-effort partial defragmentation that makes use of all available cores. The solution not only speeds up defragmentation times significantly, but it also scales for many simple cores. It can therefore even be implemented on a GPU.\n One problem with compaction is that it requires all references to moved objects to be retargeted to point to their new locations. This paper further improves existing work by a better identification of the parts of the heap that contain references to objects moved by the compactor and only processes these parts to find the references that are then retargeted in parallel.\n To demonstrate the performance of the new memory defragmentation algorithm on many-core processors, we show its performance on a modern GPU. Parallelization speeds up compaction 40 times and coalescing up to 32 times. After compaction, our algorithm only needs to process 2%--4% of the total heap to retarget references.","PeriodicalId":130040,"journal":{"name":"Workshop on Memory System Performance and Correctness","volume":"447 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Memory System Performance and Correctness","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2247684.2247693","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

High-throughput memory management techniques such as malloc/free or mark-and-sweep collectors often exhibit memory fragmentation leaving allocated objects interspersed with free memory holes. Memory defragmentation removes such holes by moving objects around in memory so that they become adjacent (compaction) and holes can be merged (coalesced) to form larger holes. However, known defragmentation techniques are slow. This paper presents a parallel solution to best-effort partial defragmentation that makes use of all available cores. The solution not only speeds up defragmentation times significantly, but it also scales for many simple cores. It can therefore even be implemented on a GPU. One problem with compaction is that it requires all references to moved objects to be retargeted to point to their new locations. This paper further improves existing work by a better identification of the parts of the heap that contain references to objects moved by the compactor and only processes these parts to find the references that are then retargeted in parallel. To demonstrate the performance of the new memory defragmentation algorithm on many-core processors, we show its performance on a modern GPU. Parallelization speeds up compaction 40 times and coalescing up to 32 times. After compaction, our algorithm only needs to process 2%--4% of the total heap to retarget references.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

GPU上的并行内存碎片整理

高吞吐量内存管理技术，如malloc/free或标记-清除收集器，通常会出现内存碎片，使分配的对象散布在空闲内存洞中。内存碎片整理通过在内存中移动对象来消除这些洞，这样它们就变得相邻(压缩)，并且洞可以合并(合并)以形成更大的洞。然而，已知的碎片整理技术是缓慢的。本文提出了一种利用所有可用内核的并行解决方案，以实现最佳的部分碎片整理。该解决方案不仅大大加快了碎片整理时间，而且还可以扩展到许多简单的内核。因此，它甚至可以在GPU上实现。压缩的一个问题是，它要求对移动对象的所有引用都要重新定位，以指向它们的新位置。本文通过更好地识别堆中包含对由压缩器移动的对象的引用的部分，进一步改进了现有的工作，并且只处理这些部分以查找随后并行重定向的引用。为了演示新的内存碎片整理算法在多核处理器上的性能，我们展示了它在现代GPU上的性能。并行化将压缩速度提高了40倍，合并速度提高了32倍。在压缩之后，我们的算法只需要处理总堆的2%- 4%来重定向引用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Workshop on Memory System Performance and Correctness

自引率

0.00%

发文量

期刊最新文献

All-window data liveness Cache rationing for multicore Software-controlled transparent management of heterogeneous memory resources in virtualized systems Program-centric cost models for locality A study of data structures with a deep heap shape