Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management最新文献

英文中文

OMR: out-of-core MapReduce for large data sets OMR:用于大数据集的out- core MapReduce

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210568

Gurneet Kaur, Keval Vora, S. C. Koduru, Rajiv Gupta

While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.

虽然单机MapReduce系统可以从可用的多核中挤出最大的性能，但它们通常受到主内存大小的限制，因此只能处理小数据集。我们的经验表明，最先进的单机内存MapReduce系统Metis经常遇到内存不足的崩溃。尽管今天的计算机配备了高效的辅助存储设备，但框架没有利用这些设备，主要是因为磁盘访问延迟比主存储器要高得多。因此，Hadoop系统的单机设置在处理比主内存大的数据集时执行速度要慢得多。此外，这样的框架还需要调优很多参数，这给程序员带来了额外的负担。在本文中，我们提出了OMR，一个核外MapReduce系统，它不仅成功地处理远远大于主存储器大小的数据集，而且还保证随着数据大小的增长而线性扩展。OMR通过动态聚合积极地最小化要读/写磁盘的数据量，并且在需要访问磁盘时使用块顺序磁盘读/写操作，以避免内存耗尽。我们从理论上证明了OMR的线性可扩展性，并通过处理比主存大5倍的数据集来实证证明它。我们的实验表明，与Hadoop系统的单机设置相比，OMR提供了更高的性能。此外，与Metis相比，OMR避免了大型数据集的内存不足崩溃，并且当数据集足够小到适合主内存时提供了更高的性能。

{"title":"OMR: out-of-core MapReduce for large data sets","authors":"Gurneet Kaur, Keval Vora, S. C. Koduru, Rajiv Gupta","doi":"10.1145/3210563.3210568","DOIUrl":"https://doi.org/10.1145/3210563.3210568","url":null,"abstract":"While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.","PeriodicalId":420262,"journal":{"name":"Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Hardware-software co-optimization of memory management in dynamic languages 动态语言中内存管理的软硬件协同优化

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210566

Mohamed Ismail, G. Suh

Dynamic programming languages are becoming increasingly popular, yet often show a significant performance slowdown compared to static languages. In this paper, we study the performance overhead of automatic memory management in dynamic languages. We propose to improve the performance and memory bandwidth usage of dynamic languages by co-optimizing garbage collection overhead and cache performance for newly-initialized and dead objects. Our study shows that less frequent garbage collection results in a large number of cache misses for initial stores to new objects. We solve this problem by directly placing uninitialized objects into on-chip caches without off-chip memory accesses. We further optimize the garbage collection by reducing unnecessary cache pollution and write-backs through partial tracing that invalidates dead objects between full garbage collections. Experimental results on PyPy and V8 show that less frequent garbage collection along with our optimizations can significantly improve the performance of dynamic languages.

动态编程语言正变得越来越流行，但与静态语言相比，动态编程语言的性能往往会显著下降。本文研究了动态语言中自动内存管理的性能开销。我们建议通过共同优化新初始化和死对象的垃圾收集开销和缓存性能来改善动态语言的性能和内存带宽使用。我们的研究表明，较少的垃圾收集会导致新对象初始存储的大量缓存丢失。我们通过直接将未初始化的对象放入片内缓存而不访问片外内存来解决这个问题。我们进一步优化垃圾收集，通过部分跟踪减少不必要的缓存污染和回写，在完全垃圾收集之间使死对象无效。在PyPy和V8上的实验结果表明，减少垃圾收集的频率以及我们的优化可以显著提高动态语言的性能。

引用次数: 2

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management 2018年ACM SIGPLAN内存管理国际研讨会论文集

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563

Christine H. Flood, Zheng Zhang

引用次数: 0

Prediction and bounds on shared cache demand from memory access interleaving 内存访问交错时共享缓存需求的预测和边界

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210565

Jacob Brock, C. Ding, Rahman Lavaee, Fangzhou Liu, Liang Yuan

Cache in multicore machines is often shared, and the cache performance depends on how memory accesses belonging to different programs interleave with one another. The full range of performance possibilities includes all possible interleavings, which are too numerous to be studied by experiments for any mix of non-trivial programs. This paper presents a theory to characterize the effect of memory access interleaving due to parallel execution of non-data-sharing programs. The theory uses an established metric called the footprint (which can be used to calculate miss ratios in fully-associative LRU caches) to measure cache demand, and considers the full range of interleaving possibilities. The paper proves a lower bound for footprints of interleaved traces, and then formulates an upper bound in terms of the footprints of the constituent traces. It also shows the correctness of footprint composition used in a number of existing techniques, and places precise bounds on its accuracy.

多核机器中的缓存通常是共享的，缓存性能取决于属于不同程序的内存访问如何相互交错。性能可能性的全部范围包括所有可能的交错，这些交错太多了，无法通过实验来研究任何非平凡程序的混合。本文提出了一种理论来描述由于非数据共享程序并行执行而导致的存储器访问交错的影响。该理论使用一种称为足迹的既定度量(可用于计算完全关联的LRU缓存中的缺失率)来衡量缓存需求，并考虑所有交错可能性。本文首先证明了交错迹迹的足迹的下界，然后根据组成迹迹的足迹给出了上界。它还显示了在许多现有技术中使用的足迹组成的正确性，并对其准确性进行了精确的限制。

引用次数: 4

Dynamic vertical memory scalability for OpenJDK cloud applications OpenJDK云应用的动态垂直内存可伸缩性

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210567

R. Bruno, P. Ferreira, R. Synytsky, Tetiana Fydorenchyk, J. Rao, Hang Huang, Song Wu

The cloud is an increasingly popular platform to deploy applications as it lets cloud users to provide resources to their applications as needed. Furthermore, cloud providers are now starting to offer a "pay-as-you-use" model in which users are only charged for the resources that are really used instead of paying for a statically sized instance. This new model allows cloud users to save money, and cloud providers to better utilize their hardware. However, applications running on top of runtime environments such as the Java Virtual Machine (JVM) cannot benefit from this new model because they cannot dynamically adapt the amount of used resources at runtime. In particular, if an application needs more memory than what was initially predicted at launch time, the JVM will not allow the application to grow its memory beyond the maximum value defined at launch time. In addition, the JVM will hold memory that is no longer being used by the application. This lack of dynamic vertical scalability completely prevents the benefits of the "pay-as-you-use" model, and forces users to over-provision resources, and to lose money on unused resources. We propose a new JVM heap sizing strategy that allows the JVM to dynamically scale its memory utilization according to the application's needs. First, we provide a configurable limit on how much the application can grow its memory. This limit is dynamic and can be changed at runtime, as opposed to the current static limit that can only be set at launch time. Second, we adapt current Garbage Collection policies that control how much the heap can grow and shrink to better fit what is currently being used by the application. The proposed solution is implemented in the OpenJDK 9 HotSpot JVM, the new release of OpenJDK. Changes were also introduced inside the Parallel Scavenge collector and the Garbage First collector (the new by-default collector in HotSpot). Evaluation experiments using real workloads and data show that, with negligible throughput and memory overhead, dynamic vertical memory scalability can be achieved. This allows users to save significant amounts of money by not paying for unused resources, and cloud providers to better utilize their physical machines.

云是一个越来越流行的部署应用程序的平台，因为它允许云用户根据需要为他们的应用程序提供资源。此外，云提供商现在开始提供一种“按使用付费”的模式，在这种模式中，用户只需为实际使用的资源付费，而不是为静态大小的实例付费。这种新模式允许云用户节省资金，云提供商更好地利用他们的硬件。然而，运行在运行时环境(如Java虚拟机(JVM))之上的应用程序不能从这个新模型中获益，因为它们不能动态地调整运行时使用的资源量。特别是，如果应用程序需要比启动时最初预测的内存更多的内存，JVM将不允许应用程序增加超过启动时定义的最大值的内存。此外，JVM将保留应用程序不再使用的内存。这种动态垂直可伸缩性的缺乏完全阻碍了“按使用付费”模式的好处，并迫使用户过度配置资源，并在未使用的资源上损失金钱。我们提出了一种新的JVM堆大小策略，该策略允许JVM根据应用程序的需要动态扩展其内存利用率。首先，我们提供了一个可配置的限制，限制应用程序可以增加多少内存。此限制是动态的，可以在运行时更改，而当前的静态限制只能在启动时设置。其次，我们调整当前的垃圾收集策略，这些策略控制堆可以增长和缩小多少，以更好地适应应用程序当前正在使用的堆。提出的解决方案在OpenJDK的新版本OpenJDK 9 HotSpot JVM中实现。在Parallel cleanup收集器和Garbage First收集器(HotSpot中新的默认收集器)内部也引入了更改。使用真实工作负载和数据的评估实验表明，在吞吐量和内存开销可以忽略不计的情况下，可以实现动态垂直内存可伸缩性。这使用户无需为未使用的资源付费，从而节省了大量资金，云提供商也可以更好地利用他们的物理机器。

{"title":"Dynamic vertical memory scalability for OpenJDK cloud applications","authors":"R. Bruno, P. Ferreira, R. Synytsky, Tetiana Fydorenchyk, J. Rao, Hang Huang, Song Wu","doi":"10.1145/3210563.3210567","DOIUrl":"https://doi.org/10.1145/3210563.3210567","url":null,"abstract":"The cloud is an increasingly popular platform to deploy applications as it lets cloud users to provide resources to their applications as needed. Furthermore, cloud providers are now starting to offer a \"pay-as-you-use\" model in which users are only charged for the resources that are really used instead of paying for a statically sized instance. This new model allows cloud users to save money, and cloud providers to better utilize their hardware. However, applications running on top of runtime environments such as the Java Virtual Machine (JVM) cannot benefit from this new model because they cannot dynamically adapt the amount of used resources at runtime. In particular, if an application needs more memory than what was initially predicted at launch time, the JVM will not allow the application to grow its memory beyond the maximum value defined at launch time. In addition, the JVM will hold memory that is no longer being used by the application. This lack of dynamic vertical scalability completely prevents the benefits of the \"pay-as-you-use\" model, and forces users to over-provision resources, and to lose money on unused resources. We propose a new JVM heap sizing strategy that allows the JVM to dynamically scale its memory utilization according to the application's needs. First, we provide a configurable limit on how much the application can grow its memory. This limit is dynamic and can be changed at runtime, as opposed to the current static limit that can only be set at launch time. Second, we adapt current Garbage Collection policies that control how much the heap can grow and shrink to better fit what is currently being used by the application. The proposed solution is implemented in the OpenJDK 9 HotSpot JVM, the new release of OpenJDK. Changes were also introduced inside the Parallel Scavenge collector and the Garbage First collector (the new by-default collector in HotSpot). Evaluation experiments using real workloads and data show that, with negligible throughput and memory overhead, dynamic vertical memory scalability can be achieved. This allows users to save significant amounts of money by not paying for unused resources, and cloud providers to better utilize their physical machines.","PeriodicalId":420262,"journal":{"name":"Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132983726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Distributed garbage collection for general graphs 通用图的分布式垃圾收集

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210572

Steven R. Brandt, H. Krishnan, C. Busch, Gokarna Sharma

We propose a scalable, cycle-collecting, decentralized, reference counting garbage collector with partial tracing. The algorithm is based on the Brownbridge system but uses four different types of references to label edges. Memory usage is O (log n) bits per node, where n is the number of nodes in the graph. The algorithm assumes an asynchronous network model with a reliable reordering channel. It collects garbage in O (E a ) time, where E a is the number of edges in the in- duced subgraph. The algorithm uses termination detection to manage the distributed computation, a unique identifier to break the symmetry among multiple collectors, and a transaction-based approach when multiple collectors conflict. Unlike existing algorithms, ours is not centralized, does not require barriers, does not require migration of nodes, does not require back-pointers on every edge, and is stable against concurrent mutation.

我们提出了一种可扩展的、循环收集的、分散的、具有部分跟踪的引用计数垃圾收集器。该算法基于布朗布里奇系统，但使用四种不同类型的引用来标记边缘。每个节点的内存使用量为O (log n)位，其中n是图中节点的数量。该算法采用具有可靠重排序通道的异步网络模型。它在O (ea)时间内收集垃圾，其中ea为引入子图中的边数。该算法使用终止检测来管理分布式计算，使用唯一标识符来打破多个收集器之间的对称性，当多个收集器发生冲突时使用基于事务的方法。与现有算法不同，我们的算法不集中，不需要障碍，不需要迁移节点，不需要在每条边都有反向指针，并且对并发突变是稳定的。

引用次数: 1

mPart: miss-ratio curve guided partitioning in key-value stores mPart:键值存储中缺失比曲线引导的分区

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210571

Daniel Byrne, Nilufer Onder, Zhenlin Wang

Web applications employ key-value stores to cache the data that is most commonly accessed. The cache improves an web application's performance by serving its requests from memory, avoiding fetching them from the backend database. Since the memory space is limited, maximizing the memory utilization is a key to delivering the best performance possible. This has lead to the use of multi-tenant systems, allowing applications to share cache space. In addition, application data access patterns change over time, so the system should be adaptive in its memory allocation. In this work, we address both multi-tenancy (where a single cache is used for multiple applications) and dynamic workloads (changing access patterns) using a model that relates the cache size to the application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the system will need to fetch the data from the database. Our efficient, online construction of the miss ratio curve allows us to determine a near optimal memory allocation given the available system memory, while adapting to changing data access patterns. We show that our model outperforms an existing state-of-the-art sharing model, Memshare, in terms of overall cache hit ratio and does so at a lower time cost. We show that for a typical system, overall hit ratio is consistently 1 percentage point greater and 99.9th percentile latency is reduced by as much as 2.9% under standard web application workloads containing millions of requests.

Web应用程序使用键值存储来缓存最常访问的数据。缓存通过从内存提供请求来提高web应用程序的性能，避免从后端数据库获取请求。由于内存空间有限，因此最大化内存利用率是提供最佳性能的关键。这导致了多租户系统的使用，允许应用程序共享缓存空间。此外，应用程序数据访问模式会随着时间的推移而改变，因此系统在内存分配方面应该是自适应的。在这项工作中，我们使用一个将缓存大小与应用程序缺失率(称为缺失率曲线)联系起来的模型来处理多租户(单个缓存用于多个应用程序)和动态工作负载(更改访问模式)。直观地说，缓存越大，系统需要从数据库中获取数据的可能性就越小。我们对缺失率曲线的高效在线构建使我们能够在给定可用系统内存的情况下确定接近最优的内存分配，同时适应不断变化的数据访问模式。我们表明，我们的模型在总体缓存命中率方面优于现有的最先进的共享模型Memshare，并且时间成本更低。我们表明，对于一个典型的系统，在包含数百万请求的标准web应用程序工作负载下，总体命中率始终高出1个百分点，99.9个百分点的延迟减少了2.9%。

{"title":"mPart: miss-ratio curve guided partitioning in key-value stores","authors":"Daniel Byrne, Nilufer Onder, Zhenlin Wang","doi":"10.1145/3210563.3210571","DOIUrl":"https://doi.org/10.1145/3210563.3210571","url":null,"abstract":"Web applications employ key-value stores to cache the data that is most commonly accessed. The cache improves an web application's performance by serving its requests from memory, avoiding fetching them from the backend database. Since the memory space is limited, maximizing the memory utilization is a key to delivering the best performance possible. This has lead to the use of multi-tenant systems, allowing applications to share cache space. In addition, application data access patterns change over time, so the system should be adaptive in its memory allocation. In this work, we address both multi-tenancy (where a single cache is used for multiple applications) and dynamic workloads (changing access patterns) using a model that relates the cache size to the application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the system will need to fetch the data from the database. Our efficient, online construction of the miss ratio curve allows us to determine a near optimal memory allocation given the available system memory, while adapting to changing data access patterns. We show that our model outperforms an existing state-of-the-art sharing model, Memshare, in terms of overall cache hit ratio and does so at a lower time cost. We show that for a typical system, overall hit ratio is consistently 1 percentage point greater and 99.9th percentile latency is reduced by as much as 2.9% under standard web application workloads containing millions of requests.","PeriodicalId":420262,"journal":{"name":"Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123697914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

FRC: a high-performance concurrent parallel deferred reference counter for C++ c++的高性能并发并行延迟引用计数器

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210569

Charles Tripp, David Hyde, Benjamin E. Grossman‐Ponemon

We present FRC, a high-performance concurrent parallel reference counter for unmanaged languages. It is well known that high-performance garbage collectors help developers write memory-safe, highly concurrent systems and data structures. While C++, C, and other unmanaged languages are used in high-performance applications, adding concurrent memory management to these languages has proven to be difficult. Unmanaged languages like C++ use pointers instead of references, and have uncooperative mutators which do not pause easily at a safe point. Thus, scanning mutator stack root references is challenging. FRC only defers decrements and does not require mutator threads to pause during collection. By deferring only decrements, FRC avoids much of the synchronization overhead of a fully-deferred implementation. Root references are scanned without interrupting the mutator by publishing these references to a thread-local array. FRC's performance can exceed that of the C++ standard library's shared pointer by orders of magnitude. FRC's thread-safety guarantees and low synchronization overhead enable significant throughput gains for concurrently-readable shared data structures. We describe the components of FRC, including our static tree router data structure: a novel barrier which improves the scalability of parallel collection workers. FRC's performance is evaluated on several concurrent data structures. We release FRC and our tests as open-source code and expect FRC will be useful for many concurrent C++ software systems.

我们提出了FRC，一个高性能的非托管语言并发并行引用计数器。众所周知，高性能垃圾收集器可以帮助开发人员编写内存安全、高度并发的系统和数据结构。尽管在高性能应用程序中使用c++、C和其他非托管语言，但向这些语言添加并发内存管理已被证明是困难的。像c++这样的非托管语言使用指针而不是引用，并且具有不合作的mutator，这些mutator不能在安全点轻松暂停。因此，扫描突变体堆栈根引用是具有挑战性的。FRC只延迟递减，不需要在收集期间暂停mutator线程。通过只延迟减数，FRC避免了完全延迟实现的大部分同步开销。通过将根引用发布到线程本地数组，可以在不中断mutator的情况下扫描根引用。FRC的性能可以超过c++标准库的共享指针的数量级。FRC的线程安全保证和低同步开销使得并发可读共享数据结构的吞吐量显著提高。我们描述了FRC的组成部分，包括我们的静态树路由器数据结构:一种新的屏障，它提高了并行收集工作者的可扩展性。在几种并发数据结构上对FRC的性能进行了评估。我们将FRC和我们的测试作为开源代码发布，并期望FRC对许多并行c++软件系统有用。

{"title":"FRC: a high-performance concurrent parallel deferred reference counter for C++","authors":"Charles Tripp, David Hyde, Benjamin E. Grossman‐Ponemon","doi":"10.1145/3210563.3210569","DOIUrl":"https://doi.org/10.1145/3210563.3210569","url":null,"abstract":"We present FRC, a high-performance concurrent parallel reference counter for unmanaged languages. It is well known that high-performance garbage collectors help developers write memory-safe, highly concurrent systems and data structures. While C++, C, and other unmanaged languages are used in high-performance applications, adding concurrent memory management to these languages has proven to be difficult. Unmanaged languages like C++ use pointers instead of references, and have uncooperative mutators which do not pause easily at a safe point. Thus, scanning mutator stack root references is challenging. FRC only defers decrements and does not require mutator threads to pause during collection. By deferring only decrements, FRC avoids much of the synchronization overhead of a fully-deferred implementation. Root references are scanned without interrupting the mutator by publishing these references to a thread-local array. FRC's performance can exceed that of the C++ standard library's shared pointer by orders of magnitude. FRC's thread-safety guarantees and low synchronization overhead enable significant throughput gains for concurrently-readable shared data structures. We describe the components of FRC, including our static tree router data structure: a novel barrier which improves the scalability of parallel collection workers. FRC's performance is evaluated on several concurrent data structures. We release FRC and our tests as open-source code and expect FRC will be useful for many concurrent C++ software systems.","PeriodicalId":420262,"journal":{"name":"Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123626078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Detailed heap profiling 详细的堆分析

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210564

Stuart Byma, J. Larus

Modern software systems heavily use the memory heap. As systems grow more complex and compute with increasing amounts of data, it can be difficult for developers to understand how their programs actually use the bytes that they allocate on the heap and whether improvements are possible. To answer this question of heap usage efficiency, we have built a new, detailed heap profiler called Memoro. Memoro uses a combination of static instrumentation, subroutine interception, and runtime data collection to build a clear picture of exactly when and where a program performs heap allocation, and crucially how it actually uses that memory. Memoro also introduces a new visualization application that can distill collected data into scores and visual cues that allow developers to quickly pinpoint and eliminate inefficient heap usage in their software. Our evaluation and experience with several applications demonstrates that Memoro can reduce heap usage and produce runtime improvements of 10%.

现代软件系统大量使用内存堆。随着系统变得越来越复杂，并且需要越来越多的数据进行计算，开发人员可能很难理解他们的程序实际上是如何使用它们在堆上分配的字节的，以及是否有可能进行改进。为了回答堆使用效率的问题，我们构建了一个新的、详细的堆分析器，名为Memoro。Memoro使用静态插装、子例程拦截和运行时数据收集的组合来构建一个清晰的图像，准确地了解程序在何时何地执行堆分配，以及它实际如何使用该内存。Memoro还引入了一个新的可视化应用程序，它可以将收集到的数据提取为分数和可视化提示，从而允许开发人员快速查明并消除软件中低效的堆使用情况。我们对几个应用程序的评估和经验表明，Memoro可以减少堆使用，并使运行时性能提高10%。

引用次数: 4

Balanced double queues for GC work-stealing on weak memory models 在弱内存模型上用于GC工作窃取的平衡双队列

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

Pub Date : 2018-06-18 DOI: 10.1145/3210563.3210570

Michihiro Horie, H. Horii, Kazunori Ogata, Tamiya Onodera

Work-stealing is promising for scheduling and balancing parallel workloads. It has a wide range of applicability on middleware, libraries, and runtime systems of programming languages. OpenJDK uses work-stealing for copying garbage collection (GC) to balance copying tasks among GC threads. Each thread has its own queue to store tasks. When a thread has no task in its queue, it acts as a thief and attempts to steal a task from another thread's queue. However, this work-stealing algorithm requires expensive memory fences for pushing, popping, and stealing tasks, especially on weak memory models such as POWER and ARM. To address this problem, we propose a work-stealing algorithm that uses double queues. Each GC thread has a public queue that is accessible from other GC threads and a private queue that is only accessible by itself. Pushing and popping tasks in the private queue are free from expensive memory fences. The most significant point in our algorithm is providing a mechanism to maintain the load balance on the basis of the use of double queues. We developed a prototype implementation for parallel GC in OpenJDK8 for ppc64le. We evaluated our algorithm by using SPECjbb2015, SPECjvm2008, TPC-DS, and Apache DayTrader.

偷工作对于调度和平衡并行工作负载很有希望。它在中间件、库和编程语言的运行时系统上具有广泛的适用性。OpenJDK使用工作窃取来复制垃圾收集(GC)，以平衡GC线程之间的复制任务。每个线程都有自己的队列来存储任务。当线程的队列中没有任务时，它就像小偷一样，试图从另一个线程的队列中窃取任务。然而，这种窃取工作的算法需要昂贵的内存围栏来推送、弹出和窃取任务，特别是在POWER和ARM等弱内存模型上。为了解决这个问题，我们提出了一个使用双队列的工作窃取算法。每个GC线程都有一个可供其他GC线程访问的公共队列和一个只能由自己访问的私有队列。在私有队列中推送和弹出任务不需要昂贵的内存屏障。我们的算法中最重要的一点是提供了一种机制，在使用双队列的基础上维持负载平衡。我们在OpenJDK8中为ppc64le开发了一个并行GC的原型实现。我们使用SPECjbb2015、SPECjvm2008、TPC-DS和Apache DayTrader来评估我们的算法。

{"title":"Balanced double queues for GC work-stealing on weak memory models","authors":"Michihiro Horie, H. Horii, Kazunori Ogata, Tamiya Onodera","doi":"10.1145/3210563.3210570","DOIUrl":"https://doi.org/10.1145/3210563.3210570","url":null,"abstract":"Work-stealing is promising for scheduling and balancing parallel workloads. It has a wide range of applicability on middleware, libraries, and runtime systems of programming languages. OpenJDK uses work-stealing for copying garbage collection (GC) to balance copying tasks among GC threads. Each thread has its own queue to store tasks. When a thread has no task in its queue, it acts as a thief and attempts to steal a task from another thread's queue. However, this work-stealing algorithm requires expensive memory fences for pushing, popping, and stealing tasks, especially on weak memory models such as POWER and ARM. To address this problem, we propose a work-stealing algorithm that uses double queues. Each GC thread has a public queue that is accessible from other GC threads and a private queue that is only accessible by itself. Pushing and popping tasks in the private queue are free from expensive memory fences. The most significant point in our algorithm is providing a mechanism to maintain the load balance on the basis of the use of double queues. We developed a prototype implementation for parallel GC in OpenJDK8 for ppc64le. We evaluated our algorithm by using SPECjbb2015, SPECjvm2008, TPC-DS, and Apache DayTrader.","PeriodicalId":420262,"journal":{"name":"Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129009152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀