首页 > 最新文献

Proceedings of the 2015 International Symposium on Memory Systems最新文献

英文 中文
Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories 面向工作负载感知的混合内存页面缓存替换策略
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818978
Ahsen J. Uppal, Mitesh R. Meswani
Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.
模堆叠DRAM是一种新兴技术,有望与非封装存储器集成在未来的系统中,从而形成混合存储系统。最近有大量的研究调查了使用叠模动态随机存取存储器(DRAM)作为硬件管理的最后一级缓存。这种方法的代价是管理大型标记阵列,增加了命中延迟,并且可能显著增加了硬件验证成本。另一种方法是让操作系统(OS)将堆叠的DRAM作为包外存储器的页面缓存来管理。然而,最近在操作系统管理的页面缓存方面的工作主要集中在FIFO替换和相关变体作为基线管理策略。在本文中,我们退后一步,研究经典的操作系统页面替换策略,并重新评估它们对混合内存的影响。我们发现,当我们使用不同的模堆叠DRAM尺寸时,最佳管理策略的选择取决于缓存大小和应用程序,并且可能导致多达13倍的性能差异。此外,在单个应用程序运行中,最佳策略的选择随时间而变化。我们还评估了共同调度的工作负载对,并发现最佳策略因工作负载对和缓存配置而异,并且最佳性能策略通常是最公平的。我们的研究激励我们继续研究开发工作负载感知和缓存配置感知的页面缓存管理策略。
{"title":"Towards Workload-Aware Page Cache Replacement Policies for Hybrid Memories","authors":"Ahsen J. Uppal, Mitesh R. Meswani","doi":"10.1145/2818950.2818978","DOIUrl":"https://doi.org/10.1145/2818950.2818978","url":null,"abstract":"Die-stacked DRAM is an emerging technology that is expected to be integrated in future systems with off-package memories resulting in a hybrid memory system. A large body of recent research has investigated the use of die-stacked dynamic random-access memory (DRAM) as a hardware-manged last-level cache. This approach comes at the costs of managing large tag arrays, increased hit latencies, and potentially significant increases in hardware verification costs. An alternative approach is for the operating system (OS) to manage the die-stacked DRAM as a page cache for off-package memories. However, recent work in OS-managed page cache focuses on FIFO replacement and related variants as the baseline management policy. In this paper, we take a step back and investigate classical OS page replacement policies and re-evaluate them for hybrid memories. We find that when we use different die-stacked DRAM sizes, the choice of best management policy depends on cache size and application, and can result in as much as a 13X performance difference. Furthermore, within a single application run, the choice of best policy varies over time. We also evaluate co-scheduled workload pairs and find that the best policy varies by workload pair and cache configuration, and that the best-performing policy is typically the most fair. Our research motivates us to continue our investigation for developing workload-aware and cache configuration-aware page cache management policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"35 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131490725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MEMST: Cloning Memory Behavior using Stochastic Traces MEMST:使用随机轨迹克隆记忆行为
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818971
Ganesh Balakrishnan, Yan Solihin
Memory Controller and DRAM architecture are critical aspects of Chip Multi Processor (CMP) design. A good design needs an in-depth understanding of end-user workloads. However, designers rarely get insights into end-user workloads because of the proprietary nature of source code or data. Workload cloning is an emerging approach that can bridge this gap by creating a proxy for the proprietary workload (clone). Cloning involves profiling workloads to glean key statistics and then generating a clone offline for use in the design environment. However, there are no existing cloning techniques for accurately capturing memory controller and DRAM behavior that can be used by designers for a wide design space exploration. We propose Memory EMulation using Stochastic Traces, MEMST, a highly accurate black box cloning framework for capturing DRAM and MC behavior. We provide a detailed analysis of statistics that are necessary to model a workload accurately. We will also show how a clone can be generated from these statistics using a novel stochastic method. Finally, we will validate our framework across a wide design space by varying DRAM organization, address mapping, DRAM frequency, page policy, scheduling policy, input bus bandwidth, chipset latency, DRAM die revision, DRAM generation and DRAM refresh policy. We evaluated MEMST using CPU2006, BioBench, Stream and PARSEC benchmark suites across the design space for single-core, dual-core, quad-core and octa-core CMPs. We measured both performance and power metrics for the original workload and clones. The clones show a very high degree of correlation with the original workload for over 7900 data points with an average error of 1.8% and 1.6% for transaction latency and DRAM power respectively.
存储器控制器和DRAM架构是芯片多处理器(CMP)设计的关键方面。一个好的设计需要深入了解最终用户的工作负载。然而,由于源代码或数据的专有性质,设计人员很少能够深入了解最终用户的工作负载。工作负载克隆是一种新兴的方法,它可以通过为专有工作负载创建代理(克隆)来弥补这一差距。克隆包括分析工作负载以收集关键统计信息,然后脱机生成一个克隆以在设计环境中使用。然而,没有现有的克隆技术可以准确地捕捉存储器控制器和DRAM行为,可以被设计师用于广泛的设计空间探索。我们提出内存仿真使用随机轨迹,MEMST,一个高度精确的黑盒克隆框架捕捉DRAM和MC行为。我们提供了对准确建模工作负载所必需的统计数据的详细分析。我们还将展示如何使用一种新的随机方法从这些统计数据中生成克隆。最后,我们将通过改变DRAM组织、地址映射、DRAM频率、页面策略、调度策略、输入总线带宽、芯片组延迟、DRAM芯片修订、DRAM生成和DRAM刷新策略,在广泛的设计空间中验证我们的框架。我们在单核、双核、四核和八核cmp的设计空间中使用CPU2006、bibench、Stream和PARSEC基准套件来评估MEMST。我们测量了原始工作负载和克隆的性能和功耗指标。克隆与超过7900个数据点的原始工作负载具有非常高的相关性,事务延迟和DRAM功耗的平均误差分别为1.8%和1.6%。
{"title":"MEMST: Cloning Memory Behavior using Stochastic Traces","authors":"Ganesh Balakrishnan, Yan Solihin","doi":"10.1145/2818950.2818971","DOIUrl":"https://doi.org/10.1145/2818950.2818971","url":null,"abstract":"Memory Controller and DRAM architecture are critical aspects of Chip Multi Processor (CMP) design. A good design needs an in-depth understanding of end-user workloads. However, designers rarely get insights into end-user workloads because of the proprietary nature of source code or data. Workload cloning is an emerging approach that can bridge this gap by creating a proxy for the proprietary workload (clone). Cloning involves profiling workloads to glean key statistics and then generating a clone offline for use in the design environment. However, there are no existing cloning techniques for accurately capturing memory controller and DRAM behavior that can be used by designers for a wide design space exploration. We propose Memory EMulation using Stochastic Traces, MEMST, a highly accurate black box cloning framework for capturing DRAM and MC behavior. We provide a detailed analysis of statistics that are necessary to model a workload accurately. We will also show how a clone can be generated from these statistics using a novel stochastic method. Finally, we will validate our framework across a wide design space by varying DRAM organization, address mapping, DRAM frequency, page policy, scheduling policy, input bus bandwidth, chipset latency, DRAM die revision, DRAM generation and DRAM refresh policy. We evaluated MEMST using CPU2006, BioBench, Stream and PARSEC benchmark suites across the design space for single-core, dual-core, quad-core and octa-core CMPs. We measured both performance and power metrics for the original workload and clones. The clones show a very high degree of correlation with the original workload for over 7900 data points with an average error of 1.8% and 1.6% for transaction latency and DRAM power respectively.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114149962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Rethinking Design Metrics for Datacenter DRAM 重新思考数据中心DRAM的设计指标
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818973
M. Awasthi
Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements.
多年来,DRAM的发展已经在访问延迟方面提供了一些改进,但已经经过优化,可以从设备提供更大的峰值带宽。现代多套接字服务器系统的综合带宽达到数百GB/s。但是,在服务器平台上运行的数据中心规模的应用程序主要关心是否能够访问大量低延迟主内存(DRAM)池,并且在最好的情况下,甚至无法利用总内存带宽的一小部分。在这篇扩展的摘要中,我们使用运行内存密集型数据中心工作负载(如Memcached)的最先进服务器的测量数据来论证主存设计,以避免优化DRAM设计的传统指标(如峰值带宽),以便能够满足数据中心服务器行业对高密度、低延迟内存和中等带宽要求的不断增长的需求。
{"title":"Rethinking Design Metrics for Datacenter DRAM","authors":"M. Awasthi","doi":"10.1145/2818950.2818973","DOIUrl":"https://doi.org/10.1145/2818950.2818973","url":null,"abstract":"Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127906438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
S-L1: A Software-based GPU L1 Cache that Outperforms the Hardware L1 for Data Processing Applications S-L1:基于软件的GPU L1缓存,在数据处理应用中性能优于硬件L1
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818969
Reza Mokhtari, M. Stumm
Implementing a GPU L1 data cache entirely in software to usurp the hardware L1 cache sounds counter-intuitive. However, we show how a software L1 cache can perform significantly better than the hardware L1 cache for data-intensive streaming (i.e., "Big-Data") GPGPU applications. Hardware L1 data caches can perform poorly on current GPUs, because the size of the L1 is far too small and its cache line size is too large given the number of threads that typically need to run in parallel. Our paper makes two contributions. First, we experimentally characterize the performance behavior of modern GPU memory hierarchies and in doing so identify a number of bottlenecks. Secondly, we describe the design and implementation of a software L1 cache, S-L1. On ten streaming GPGPU applications, S-L1 performs 1.9 times faster, on average, when compared to using the default hardware L1, and 2.1 times faster, on average, when compared to using no L1 cache.
完全在软件中实现GPU L1数据缓存以取代硬件L1缓存听起来违反直觉。然而,我们展示了软件L1缓存如何在数据密集型流(即“大数据”)GPGPU应用程序中比硬件L1缓存表现得更好。硬件L1数据缓存在当前gpu上的性能很差,因为L1的大小太小,而且考虑到通常需要并行运行的线程数量,它的缓存行大小太大。我们的论文有两个贡献。首先,我们通过实验表征了现代GPU内存层次结构的性能行为,并在此过程中确定了一些瓶颈。其次,我们描述了软件L1缓存S-L1的设计和实现。在10个流式GPGPU应用程序上,与使用默认硬件L1相比,S-L1的执行速度平均快1.9倍,与不使用L1缓存相比,平均快2.1倍。
{"title":"S-L1: A Software-based GPU L1 Cache that Outperforms the Hardware L1 for Data Processing Applications","authors":"Reza Mokhtari, M. Stumm","doi":"10.1145/2818950.2818969","DOIUrl":"https://doi.org/10.1145/2818950.2818969","url":null,"abstract":"Implementing a GPU L1 data cache entirely in software to usurp the hardware L1 cache sounds counter-intuitive. However, we show how a software L1 cache can perform significantly better than the hardware L1 cache for data-intensive streaming (i.e., \"Big-Data\") GPGPU applications. Hardware L1 data caches can perform poorly on current GPUs, because the size of the L1 is far too small and its cache line size is too large given the number of threads that typically need to run in parallel. Our paper makes two contributions. First, we experimentally characterize the performance behavior of modern GPU memory hierarchies and in doing so identify a number of bottlenecks. Secondly, we describe the design and implementation of a software L1 cache, S-L1. On ten streaming GPGPU applications, S-L1 performs 1.9 times faster, on average, when compared to using the default hardware L1, and 2.1 times faster, on average, when compared to using no L1 cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125789018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Implications of Memory Interference for Composed HPC Applications 组合高性能计算应用中内存干扰的含义
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818965
Brian Kocoloski, Yuyu Zhou, B. Childers, J. Lange
The cost of inter-node I/O and data movement is becoming increasingly prohibitive for large scale High Performance Computing (HPC) applications. This trend is leading to the emergence of composed in situ applications that co-locate multiple components on the same node. However, these components may contend for underlying memory system resources. In this extended research abstract, we present a preliminary evaluation of the impacts of contention for shared resources in the memory hierarchy, including the last level cache (LLC) and DRAM bandwidth. We show that even modest levels of memory contention can have substantial performance implications for some benchmarks, and argue for a cross layer approach to resource partitioning and scheduling on future HPC systems.
对于大规模高性能计算(HPC)应用程序来说,节点间I/O和数据移动的成本正变得越来越高。这一趋势导致了组合原位应用程序的出现,这些应用程序在同一节点上共同定位多个组件。然而,这些组件可能会争夺底层内存系统资源。在这篇扩展的研究摘要中,我们对内存层次中共享资源争用的影响进行了初步评估,包括最后一级缓存(LLC)和DRAM带宽。我们表明,即使是适度的内存争用也会对某些基准测试产生实质性的性能影响,并主张在未来的HPC系统上采用跨层方法进行资源分区和调度。
{"title":"Implications of Memory Interference for Composed HPC Applications","authors":"Brian Kocoloski, Yuyu Zhou, B. Childers, J. Lange","doi":"10.1145/2818950.2818965","DOIUrl":"https://doi.org/10.1145/2818950.2818965","url":null,"abstract":"The cost of inter-node I/O and data movement is becoming increasingly prohibitive for large scale High Performance Computing (HPC) applications. This trend is leading to the emergence of composed in situ applications that co-locate multiple components on the same node. However, these components may contend for underlying memory system resources. In this extended research abstract, we present a preliminary evaluation of the impacts of contention for shared resources in the memory hierarchy, including the last level cache (LLC) and DRAM bandwidth. We show that even modest levels of memory contention can have substantial performance implications for some benchmarks, and argue for a cross layer approach to resource partitioning and scheduling on future HPC systems.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals 指令卸载与HMC 2.0标准:图遍历的案例研究
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818982
Lifeng Nai, Hyesoon Kim
Processing in Memory (PIM) was first proposed decades ago for reducing the overhead of data movement between core and memory. With the advances in 3D-stacking technologies, recently PIM architectures have regained researchers' attentions. Several fully-programmable PIM architectures as well as programming models were proposed in previous literature. Meanwhile, memory industry also starts to integrate computation units into Hybrid Memory Cube (HMC). In HMC 2.0 specification, a number of atomic instructions are supported. Although the instruction support is limited, it enables us to offload computations at instruction granularity. In this paper, we present a preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example. By demonstrating the programmability and performance benefits, we show the feasibility of an instruction-level offloading PIM architecture.
内存中处理(PIM)是几十年前首次提出的,目的是减少内核和内存之间数据移动的开销。随着3d叠加技术的进步,近年来PIM结构重新引起了研究人员的关注。在以前的文献中提出了几种完全可编程的PIM体系结构以及编程模型。与此同时,存储行业也开始将计算单元集成到混合存储立方体(HMC)中。在HMC 2.0规范中,支持许多原子指令。尽管指令支持是有限的,但它使我们能够在指令粒度上卸载计算。本文以图遍历为例,对hmc2.0上的指令卸载进行了初步研究。通过演示可编程性和性能优势,我们展示了指令级卸载PIM体系结构的可行性。
{"title":"Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals","authors":"Lifeng Nai, Hyesoon Kim","doi":"10.1145/2818950.2818982","DOIUrl":"https://doi.org/10.1145/2818950.2818982","url":null,"abstract":"Processing in Memory (PIM) was first proposed decades ago for reducing the overhead of data movement between core and memory. With the advances in 3D-stacking technologies, recently PIM architectures have regained researchers' attentions. Several fully-programmable PIM architectures as well as programming models were proposed in previous literature. Meanwhile, memory industry also starts to integrate computation units into Hybrid Memory Cube (HMC). In HMC 2.0 specification, a number of atomic instructions are supported. Although the instruction support is limited, it enables us to offload computations at instruction granularity. In this paper, we present a preliminary study of instruction offloading on HMC 2.0 using graph traversals as an example. By demonstrating the programmability and performance benefits, we show the feasibility of an instruction-level offloading PIM architecture.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134380293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Near memory data structure rearrangement 近内存数据结构重排
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818986
M. Gokhale, G. S. Lloyd, C. Hajas
As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.
随着CPU核数的不断增加,计算能力和可用内存带宽之间的差距已经扩大。更大和更深的缓存层次结构有利于位置友好型计算,但对不规则的数据密集型应用程序的改进有限。在这项工作中,我们探索了一种通过内存中的数据重构来加速这些应用程序的新方法。与其他提出的内存处理体系结构不同,重排硬件执行数据减少,而不是计算卸载。使用定制的FPGA仿真器,我们定量地评估了近内存硬件结构的性能和能源效益,该结构动态地将内存中的数据重构为缓存友好的布局,最大限度地减少了内存带宽的浪费。我们使用Micron Hybrid Memory Cube内存模型对具有代表性的不规则基准测试进行了测试,结果显示加速、带宽节省和能耗降低。提出了一种近内存加速器的API,并通过应用实例描述了CPU与重排硬件之间的交互。对SRAM与DRAM的优点进行了探讨,以重新排列数据。
{"title":"Near memory data structure rearrangement","authors":"M. Gokhale, G. S. Lloyd, C. Hajas","doi":"10.1145/2818950.2818986","DOIUrl":"https://doi.org/10.1145/2818950.2818986","url":null,"abstract":"As CPU core counts continue to increase, the gap between compute power and available memory bandwidth has widened. A larger and deeper cache hierarchy benefits locality-friendly computation, but offers limited improvement to irregular, data intensive applications. In this work we explore a novel approach to accelerating these applications through in-memory data restructuring. Unlike other proposed processing-in-memory architectures, the rearrangement hardware performs data reduction, not compute offload. Using a custom FPGA emulator, we quantitatively evaluate performance and energy benefits of near-memory hardware structures that dynamically restructure in-memory data to cache-friendly layout, minimizing wasted memory bandwidth. Our results on representative irregular benchmarks using the Micron Hybrid Memory Cube memory model show speedup, bandwidth savings, and energy reduction. We present an API for the near-memory accelerator and describe the interaction between the CPU and the rearrangement hardware with application examples. The merits of an SRAM vs. a DRAM scratchpad buffer for rearranged data are explored.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134525205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
Dynamic Memory Pressure Aware Ballooning 动态记忆压力感知气球
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818967
Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy
Hardware virtualization is a major component of large scale server and data center deployments due to their facilitation of server consolidation and scalability. Virtualization, however, comes at a high cost in terms of system main memory utilization. Current virtual machine (VM) memory management solutions impose a high performance penalty and are oblivious to the operating regime of the system. Therefore, there is a great need for low-impact VM memory management techniques which are aware of and reactive to current system state, to drive down the overheads of virtualization. We observe that the host machine operates under different memory pressure regimes, as the memory demand from guest VMs changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. In this paper, we propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation. We show that MPA ballooning provides an 13.2% geomean speed-up versus the current ballooning techniques across a set of application mixes running in guest VMs; often yielding performance nearly identical to that of a non-memory constrained system.
硬件虚拟化是大规模服务器和数据中心部署的主要组成部分,因为它们有助于服务器整合和可伸缩性。然而,就系统主内存利用率而言,虚拟化的代价很高。当前的虚拟机(VM)内存管理解决方案带来了很高的性能损失,并且忽略了系统的操作机制。因此,非常需要低影响的VM内存管理技术,这种技术能够感知并响应当前系统状态,从而降低虚拟化的开销。我们观察到主机在不同的内存压力下运行,因为客户机vm的内存需求在运行时动态变化。适应这种运行时系统状态对于降低VM内存管理的性能成本至关重要。本文提出了一种新的动态内存管理策略,称为内存压力感知(MPA)膨胀。MPA膨胀机制根据当前内存压力动态分配内存资源给每个虚拟机。此外,MPA膨胀可以主动响应和适应来宾虚拟机内存需求的突然变化。MPA膨胀既不需要额外的硬件支持,也不会在内存压力估计中导致额外的小页面错误。我们表明,与当前的膨胀技术相比,MPA膨胀在一组运行在客户机vm中的应用程序混合中提供了13.2%的几何速度提升;通常产生与非内存约束系统几乎相同的性能。
{"title":"Dynamic Memory Pressure Aware Ballooning","authors":"Jinchun Kim, Viacheslav V. Fedorov, Paul V. Gratz, A. Reddy","doi":"10.1145/2818950.2818967","DOIUrl":"https://doi.org/10.1145/2818950.2818967","url":null,"abstract":"Hardware virtualization is a major component of large scale server and data center deployments due to their facilitation of server consolidation and scalability. Virtualization, however, comes at a high cost in terms of system main memory utilization. Current virtual machine (VM) memory management solutions impose a high performance penalty and are oblivious to the operating regime of the system. Therefore, there is a great need for low-impact VM memory management techniques which are aware of and reactive to current system state, to drive down the overheads of virtualization. We observe that the host machine operates under different memory pressure regimes, as the memory demand from guest VMs changes dynamically at runtime. Adapting to this runtime system state is critical to reduce the performance cost of VM memory management. In this paper, we propose a novel dynamic memory management policy called Memory Pressure Aware (MPA) ballooning. MPA ballooning dynamically allocates memory resources to each VM based on the current memory pressure regime. Moreover, MPA ballooning proactively reacts and adapts to sudden changes in memory demand from guest VMs. MPA ballooning requires neither additional hardware support, nor incurs extra minor page faults in its memory pressure estimation. We show that MPA ballooning provides an 13.2% geomean speed-up versus the current ballooning techniques across a set of application mixes running in guest VMs; often yielding performance nearly identical to that of a non-memory constrained system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114549057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Software Techniques for Scratchpad Memory Management Scratchpad内存管理软件技术
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818966
Paul Sebexen, Thomas Sohmers
Scratchpad memory is commonly encountered in embedded systems as an alternative or supplement to caches [3], however, cache-containing architectures continue to be preferred in many applications due to their general ease of programmability. A language-agnostic software management system is envisioned that improves portability to scratchpad architectures and significantly lowers power consumption of ported applications. We review a selection of existing techniques, discuss their applicability to various memory systems, and identify opportunities for applying new methods and optimizations to improve memory management on relevant architectures.
在嵌入式系统中,作为缓存的替代或补充[3],刮记板内存经常会遇到,然而,由于其一般易于编程,包含缓存的架构在许多应用中仍然是首选。我们设想了一种与语言无关的软件管理系统,它可以提高对临时架构的可移植性,并显著降低移植应用程序的功耗。我们回顾了现有技术的选择,讨论了它们对各种内存系统的适用性,并确定了应用新方法和优化来改进相关架构上的内存管理的机会。
{"title":"Software Techniques for Scratchpad Memory Management","authors":"Paul Sebexen, Thomas Sohmers","doi":"10.1145/2818950.2818966","DOIUrl":"https://doi.org/10.1145/2818950.2818966","url":null,"abstract":"Scratchpad memory is commonly encountered in embedded systems as an alternative or supplement to caches [3], however, cache-containing architectures continue to be preferred in many applications due to their general ease of programmability. A language-agnostic software management system is envisioned that improves portability to scratchpad architectures and significantly lowers power consumption of ported applications. We review a selection of existing techniques, discuss their applicability to various memory systems, and identify opportunities for applying new methods and optimizations to improve memory management on relevant architectures.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130978381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Architecture Exploration for Data Intensive Applications 数据密集型应用的架构探索
Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818970
Fernando Martin del Campo, P. Chow
This paper presents Compass, a hardware/software simulator for data-intensive applications. Currently focusing on in-memory stores, the objective of the simulator is to explore diverse algorithms and hardware architectures, serving as an aid to design systems for applications in which the elevated rate of data transfers dictates their behaviour. Instead of simulating the devices of a conventional computing system, in Compass the modules represent the stages of the procedure to attend a request to store, retrieve, or delete information in a particular memory architecture, giving the simulator the flexibility to test and analyze several different algorithms, components, and ideas. The system maintains a cycle-accurate model that makes it easy to interface it with simulators of physical devices such as RAM memories. Under a scheme like this one, the simulator of a physical memory in the system anchors the timing to a realistic scenario, but the rest of the components can be easily modified to explore alternative approaches.
本文介绍了Compass,一个用于数据密集型应用的硬件/软件模拟器。目前专注于内存存储,模拟器的目标是探索不同的算法和硬件架构,作为一个辅助设计系统的应用程序,在其中,数据传输速率的提高决定了他们的行为。Compass中的模块不是模拟传统计算系统的设备,而是代表处理特定内存体系结构中存储、检索或删除信息的请求的过程的各个阶段,从而使模拟器能够灵活地测试和分析几种不同的算法、组件和想法。该系统保持一个周期精确的模型,使得它很容易与物理设备(如RAM存储器)的模拟器接口。在这样的方案下,系统中物理内存的模拟器将时间固定到一个现实的场景中,但是其余的组件可以很容易地修改以探索替代方法。
{"title":"Architecture Exploration for Data Intensive Applications","authors":"Fernando Martin del Campo, P. Chow","doi":"10.1145/2818950.2818970","DOIUrl":"https://doi.org/10.1145/2818950.2818970","url":null,"abstract":"This paper presents Compass, a hardware/software simulator for data-intensive applications. Currently focusing on in-memory stores, the objective of the simulator is to explore diverse algorithms and hardware architectures, serving as an aid to design systems for applications in which the elevated rate of data transfers dictates their behaviour. Instead of simulating the devices of a conventional computing system, in Compass the modules represent the stages of the procedure to attend a request to store, retrieve, or delete information in a particular memory architecture, giving the simulator the flexibility to test and analyze several different algorithms, components, and ideas. The system maintains a cycle-accurate model that makes it easy to interface it with simulators of physical devices such as RAM memories. Under a scheme like this one, the simulator of a physical memory in the system anchors the timing to a realistic scenario, but the rest of the components can be easily modified to explore alternative approaches.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124848696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 2015 International Symposium on Memory Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1