Proceedings of the International Symposium on Memory Systems最新文献

PEPERONI: Pre-Estimating the Performance of Near-Memory Integration 预估近记忆整合的效能

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519329

Oliver Lenke, Richard Petri, Thomas Wild, A. Herkersdorf

Near-memory integration strives to tackle the challenge of low data locality and power consumption originating from cross- chip data transfers, meanwhile referred to as locality wall. In order to keep costly engineering efforts bounded when transforming an existing non-near-memory architecture into a near-memory instance, reliable performance estimation during early design stages is needed. We propose PEPERONI, an agile performance estimation model to predict the runtime of representative benchmarks under near-memory acceleration on an MPSoC prototype. By relying solely on measurements of an existing baseline architecture, the method provides reliable estimations on the performance of near-memory processing units before their expensive implementation. The model is based on a quantitative description of memory boundedness and is independent of algorithmic knowledge, what facilitates its applicability to various applications.

近内存集成致力于解决低数据局部性和跨片数据传输产生的功耗问题，同时称为局部性墙。在将现有的非近内存架构转换为近内存实例时，为了限制昂贵的工程努力，需要在早期设计阶段进行可靠的性能评估。我们提出了PEPERONI，一个敏捷的性能估计模型，用于预测MPSoC原型在近内存加速下的代表性基准测试的运行时间。通过仅依赖于现有基线体系结构的测量，该方法提供了近内存处理单元在其昂贵的实现之前的可靠性能估计。该模型基于对内存有界性的定量描述，并且不依赖于算法知识，这有利于其适用于各种应用。

引用次数: 0

Learning to Rank Graph-based Application Objects on Heterogeneous Memories 学习在异构存储器上对基于图的应用程序对象进行排序

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519324

Diego Moura, V. Petrucci, D. Mossé

Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application’s performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline’s approach based on LLC misses indicator.

持久性存储器(PMEM)，也称为非易失性存储器(NVM)，与DRAM相比，可以提供更高的密度和更低的每比特成本。它的主要缺点是通常比DRAM慢。另一方面，由于其成本和能耗，DRAM存在可扩展性问题。很快，PMEM可能会与DRAM共存于计算机系统中，但最大的挑战是知道在每种类型的内存上分配哪些数据。本文描述了使用Intel Optane DC Persistent Memory识别和描述对应用程序性能影响最大的应用程序对象的方法。在我们工作的第一部分中，我们构建了一个工具，它可以自动分析和分析应用程序对象。在第二部分中，我们建立了一个机器学习模型来预测大规模基于图的应用程序中最关键的对象。我们的结果表明，与使用一组精心挑选的特征相比，使用孤立的特征并没有带来同样的好处。通过使用我们的预测模型执行数据放置，与基于LLC遗漏指标的基线方法相比，我们可以将执行时间降低12%(平均)和30%(最大)。

{"title":"Learning to Rank Graph-based Application Objects on Heterogeneous Memories","authors":"Diego Moura, V. Petrucci, D. Mossé","doi":"10.1145/3488423.3519324","DOIUrl":"https://doi.org/10.1145/3488423.3519324","url":null,"abstract":"Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application’s performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline’s approach based on LLC misses indicator.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131892438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DuoMC: Tight DRAM Latency Bounds with Shared Banks and Near-COTS Performance DuoMC:紧DRAM延迟边界与共享银行和近cots性能

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519322

Reza Mirosanlou, Mohamed Hassan, R. Pellizzoni

DRAM memory controllers (MCs) in COTS systems are designed primarily for average performance, offering no worst-case guarantees, while real-time MCs provide timing guarantees at the cost of a significant average performance degradation. For this reason, hardware vendors have been reluctant to integrate real-time solutions in high-performance platforms. In this paper, we overcome this performance-predictability trade-off by introducing DuoMC, a novel memory controller that promotes to augment COTS MCs with a real-time scheduler and run-time monitoring to provide predictability guarantees. Leveraging the fact that the resource is barely overloaded, DuoMC allows the system to enjoy the high-performance of the conventional MC most of the time, while switching to the real-time scheduler only when timing guarantees risk being violated, which rarely occurs. In addition, unlike most existing real-time MCs, DuoMC enables the utilization of both private and shared DRAM banks among cores to facilitate communication among tasks. We evaluate DuoMC using a cycle-accurate multi-core simulator. Results show that DuoMC can provide better or comparable latency guarantees than state-of-the-art real-time MCs with limited performance loss (only 8% in the worst scenario) compared to the COTS MC.

COTS系统中的DRAM内存控制器(mc)主要是为平均性能而设计的，不提供最坏情况保证，而实时mc以显著的平均性能下降为代价提供时序保证。由于这个原因，硬件供应商一直不愿意在高性能平台中集成实时解决方案。在本文中，我们通过引入DuoMC来克服这种性能可预测性的权衡，DuoMC是一种新的内存控制器，它通过实时调度程序和运行时监控来增强COTS mc，以提供可预测性保证。利用资源几乎不会过载的事实，DuoMC允许系统在大多数情况下享受传统MC的高性能，而只有在时间保证风险被违反时才切换到实时调度器，而这种情况很少发生。此外，与大多数现有的实时mc不同，DuoMC允许在核心之间使用私有和共享的DRAM库，以促进任务之间的通信。我们使用周期精确的多核模拟器来评估DuoMC。结果表明，与最先进的实时MC相比，DuoMC可以提供更好或相当的延迟保证，并且与COTS MC相比，性能损失有限(最坏情况下仅为8%)。

{"title":"DuoMC: Tight DRAM Latency Bounds with Shared Banks and Near-COTS Performance","authors":"Reza Mirosanlou, Mohamed Hassan, R. Pellizzoni","doi":"10.1145/3488423.3519322","DOIUrl":"https://doi.org/10.1145/3488423.3519322","url":null,"abstract":"DRAM memory controllers (MCs) in COTS systems are designed primarily for average performance, offering no worst-case guarantees, while real-time MCs provide timing guarantees at the cost of a significant average performance degradation. For this reason, hardware vendors have been reluctant to integrate real-time solutions in high-performance platforms. In this paper, we overcome this performance-predictability trade-off by introducing DuoMC, a novel memory controller that promotes to augment COTS MCs with a real-time scheduler and run-time monitoring to provide predictability guarantees. Leveraging the fact that the resource is barely overloaded, DuoMC allows the system to enjoy the high-performance of the conventional MC most of the time, while switching to the real-time scheduler only when timing guarantees risk being violated, which rarely occurs. In addition, unlike most existing real-time MCs, DuoMC enables the utilization of both private and shared DRAM banks among cores to facilitate communication among tasks. We evaluate DuoMC using a cycle-accurate multi-core simulator. Results show that DuoMC can provide better or comparable latency guarantees than state-of-the-art real-time MCs with limited performance loss (only 8% in the worst scenario) compared to the COTS MC.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Zoned FTL: Achieve Resource Isolation via Hardware Virtualization 分区FTL:通过硬件虚拟化实现资源隔离

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519326

Luyi Kang, B. Jacob

NVMe Solid-State Drives (SSDs) offer unprecedented throughput and response time for data centers. To increase resource utilization and enable necessary isolation, service providers usually accommodate multiple Virtual Machines (VMs) and lightweight containers on the same physical server. Till today, providing predictable storage performance is still challenging as commercial datacenter NVMe SSDs still appear as black-box block devices. This motivates us to re-examine the I/O stack and firmware design, discuss and quantify the root causes of performance interference. We argue that the semantic gap between predictable performance and the underlying device must be bridged to address this challenge. We propose a split-level design, Zoned FTL (), which enables strong physical isolation for multiple virtualized services with minimal changes in existing storage stacks. We implement the prototype on an SSD emulator and evaluate it under a variety of multi-tenant environments. The evaluation results demonstrate that barely impacts the raw performance while delivering up to 1.51x better throughput and reduce the 99th percentile latency by up to 79.4% in a multi-tenancy environment.

NVMe固态硬盘(ssd)为数据中心提供了前所未有的吞吐量和响应时间。为了提高资源利用率并启用必要的隔离，服务提供商通常在同一物理服务器上容纳多个虚拟机(vm)和轻量级容器。直到今天，提供可预测的存储性能仍然具有挑战性，因为商业数据中心NVMe ssd仍然以黑箱块设备的形式出现。这促使我们重新检查I/O堆栈和固件设计，讨论和量化性能干扰的根本原因。我们认为，必须弥合可预测性能和底层设备之间的语义差距，以应对这一挑战。我们提出了一种分层设计，分区FTL()，它可以在对现有存储堆栈进行最小更改的情况下为多个虚拟化服务提供强大的物理隔离。我们在SSD模拟器上实现了原型，并在各种多租户环境下对其进行了评估。评估结果表明，这几乎不会影响原始性能，同时在多租户环境中提供高达1.51倍的吞吐量，并将第99百分位延迟减少高达79.4%。

{"title":"Zoned FTL: Achieve Resource Isolation via Hardware Virtualization","authors":"Luyi Kang, B. Jacob","doi":"10.1145/3488423.3519326","DOIUrl":"https://doi.org/10.1145/3488423.3519326","url":null,"abstract":"NVMe Solid-State Drives (SSDs) offer unprecedented throughput and response time for data centers. To increase resource utilization and enable necessary isolation, service providers usually accommodate multiple Virtual Machines (VMs) and lightweight containers on the same physical server. Till today, providing predictable storage performance is still challenging as commercial datacenter NVMe SSDs still appear as black-box block devices. This motivates us to re-examine the I/O stack and firmware design, discuss and quantify the root causes of performance interference. We argue that the semantic gap between predictable performance and the underlying device must be bridged to address this challenge. We propose a split-level design, Zoned FTL (), which enables strong physical isolation for multiple virtualized services with minimal changes in existing storage stacks. We implement the prototype on an SSD emulator and evaluate it under a variety of multi-tenant environments. The evaluation results demonstrate that barely impacts the raw performance while delivering up to 1.51x better throughput and reduce the 99th percentile latency by up to 79.4% in a multi-tenancy environment.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122077069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DRAM Refresh with Master Wordline Granularity Control of Refresh Intervals: Position Paper DRAM刷新与主Wordline粒度控制的刷新间隔:位置纸

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519321

T. Vogelsang, Brent Haukness, E. Linstadt, Torsten Partsch, James Tringali

DRAM cells need to be periodically refreshed to preserve the charge stored in them. There are multiple mechanisms causing the loss of charge. These mechanisms also vary in strength. The retention time is therefore not the same for all DRAM cells but follows a distribution with multiple orders of magnitude difference between the retention time of cells with the highest charge loss and the cells with the lowest charge loss. Today's DRAM standards have one refresh interval that is set based on the retention time of the weakest cells that are not replaced by redundancy or repaired with ECC. Refresh adds overhead to DRAM energy consumption and command bandwidth and blocks access to banks currently being refreshed. This overhead increases with larger DRAM capacity and shrinking DRAM feature size. In this paper we propose a method and corresponding circuit implementation that allows using different refresh intervals based on the required minimum retention time of cells on those wordlines. We show that this method can increase the effective refresh interval of a modern DRAM between 40% and 80% without loss of reliability and a corresponding reduction of the contribution of refresh to energy consumption and command bandwidth. Our evaluation shows that the method can be implemented with a moderate DRAM die size impact (between 1% and 2.5%). The method does not require the memory controller to keep track of refresh addresses. After initialization of the DRAM devices, the memory controller needs only to issue refresh commands as today, albeit a smaller number than without our approach.

DRAM单元需要定期刷新以保持存储在其中的电荷。造成电荷损失的机制有多种。这些机制的强度也各不相同。因此，对于所有的DRAM单元，保留时间是不相同的，但在具有最高电荷损失的单元和具有最低电荷损失的单元之间的保留时间遵循多个数量级差异的分布。今天的DRAM标准有一个刷新间隔，它是根据最弱单元的保留时间设置的，这些单元没有被冗余替换或用ECC修复。刷新增加了DRAM能耗和命令带宽的开销，并阻塞了对当前正在刷新的银行的访问。这种开销随着DRAM容量的增大和DRAM特性尺寸的缩小而增加。在本文中，我们提出了一种方法和相应的电路实现，允许基于这些字线上所需的最小保留时间使用不同的刷新间隔。我们表明，这种方法可以在不损失可靠性的情况下将现代DRAM的有效刷新间隔增加40%至80%，并相应减少刷新对能耗和命令带宽的贡献。我们的评估表明，该方法可以实现适度的DRAM芯片尺寸的影响(1%和2.5%之间)。该方法不需要内存控制器跟踪刷新地址。在初始化DRAM设备之后，内存控制器只需要像今天一样发出刷新命令，尽管数量比没有我们的方法时要少。

{"title":"DRAM Refresh with Master Wordline Granularity Control of Refresh Intervals: Position Paper","authors":"T. Vogelsang, Brent Haukness, E. Linstadt, Torsten Partsch, James Tringali","doi":"10.1145/3488423.3519321","DOIUrl":"https://doi.org/10.1145/3488423.3519321","url":null,"abstract":"DRAM cells need to be periodically refreshed to preserve the charge stored in them. There are multiple mechanisms causing the loss of charge. These mechanisms also vary in strength. The retention time is therefore not the same for all DRAM cells but follows a distribution with multiple orders of magnitude difference between the retention time of cells with the highest charge loss and the cells with the lowest charge loss. Today's DRAM standards have one refresh interval that is set based on the retention time of the weakest cells that are not replaced by redundancy or repaired with ECC. Refresh adds overhead to DRAM energy consumption and command bandwidth and blocks access to banks currently being refreshed. This overhead increases with larger DRAM capacity and shrinking DRAM feature size. In this paper we propose a method and corresponding circuit implementation that allows using different refresh intervals based on the required minimum retention time of cells on those wordlines. We show that this method can increase the effective refresh interval of a modern DRAM between 40% and 80% without loss of reliability and a corresponding reduction of the contribution of refresh to energy consumption and command bandwidth. Our evaluation shows that the method can be implemented with a moderate DRAM die size impact (between 1% and 2.5%). The method does not require the memory controller to keep track of refresh addresses. After initialization of the DRAM devices, the memory controller needs only to issue refresh commands as today, albeit a smaller number than without our approach.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward Computer Vision-based Machine Intelligent Hybrid Memory Management 基于计算机视觉的机器智能混合内存管理研究

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519325

Thaleia Dimitra Doudali, Ada Gavrilovska

Current state-of-the-art systems for hybrid memory management are enriched with machine intelligence. To enable the practical use of Machine Learning (ML), system-level page schedulers focus the ML model training over a small subset of the applications’ memory footprint. At the same time, they use existing lightweight historical information to predict the access behavior of majority of the pages. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper explores the opportunities to reduce such operational overheads of machine learning-based hybrid memory page schedulers via use of visualization techniques to depict memory access patterns, and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We propose an initial version of a visualization pipeline for prioritizing pages for machine learning, that is independent of the hybrid memory configuration. Our approach selects pages whose ML-based management delivers, on average, performance levels within 5% of current solutions, while reducing by 75 × the page selection time. We discuss future directions and make a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions.

当前最先进的混合内存管理系统丰富了机器智能。为了实现机器学习(ML)的实际应用，系统级页面调度器将ML模型训练集中在应用程序内存占用的一小部分上。同时，它们使用现有的轻量级历史信息来预测大多数页面的访问行为。为了最大限度地提高应用程序的性能，为基于机器学习的管理选择的页面使用精心设计的页面选择方法进行识别。这些方法涉及根据混合内存平台的配置计算详细的性能估计。本文探讨了通过使用可视化技术来描述内存访问模式，并揭示所选页面之间的空间和时间相关性，从而减少基于机器学习的混合内存页面调度器的操作开销的机会，而当前的方法无法利用这些操作开销。我们提出了一个可视化管道的初始版本，用于为机器学习确定页面的优先级，它独立于混合内存配置。我们的方法选择的页面，其基于ml的管理提供的性能水平平均在当前解决方案的5%以内，同时减少了75倍的页面选择时间。我们讨论了未来的方向，并提出了可视化和计算机视觉方法可以解锁新的见解并降低新兴系统解决方案的操作复杂性的案例。

{"title":"Toward Computer Vision-based Machine Intelligent Hybrid Memory Management","authors":"Thaleia Dimitra Doudali, Ada Gavrilovska","doi":"10.1145/3488423.3519325","DOIUrl":"https://doi.org/10.1145/3488423.3519325","url":null,"abstract":"Current state-of-the-art systems for hybrid memory management are enriched with machine intelligence. To enable the practical use of Machine Learning (ML), system-level page schedulers focus the ML model training over a small subset of the applications’ memory footprint. At the same time, they use existing lightweight historical information to predict the access behavior of majority of the pages. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper explores the opportunities to reduce such operational overheads of machine learning-based hybrid memory page schedulers via use of visualization techniques to depict memory access patterns, and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We propose an initial version of a visualization pipeline for prioritizing pages for machine learning, that is independent of the hybrid memory configuration. Our approach selects pages whose ML-based management delivers, on average, performance levels within 5% of current solutions, while reducing by 75 × the page selection time. We discuss future directions and make a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"50 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124528522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SoftRefresh: Targeted refresh for Energy-efficient DRAM systems via Software and Operating Systems support SoftRefresh:通过软件和操作系统支持对节能DRAM系统进行有针对性的刷新

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519323

Duy-Thanh Nguyen, Nhut-Minh Ho, I. Chang

Due to its capacitive nature, DRAM cells must be refreshed regularly to retain their information. However, due to the scale of DRAM deployment in modern computer systems, the energy overhead of DRAM refresh operations is becoming significant. The crux in managing DRAM refresh is knowing if the data in particular cells are valid or not. Previous works have suggested many hardware schemes that effectively try to guess this. In this paper, we propose modifications to allow software involvement in regulating refresh operations. This opens the door for targeted, and hence minimal, refresh operations. Only valid pages having potential bit errors will be refreshed. Compared to conventionally refreshing the whole DRAM, our SoftRefresh saves up to 43% energy on average. Our proposal can work on all types of modern DRAM with only minor modifications to the existing hardware and software systems.

由于其电容性，DRAM单元必须定期刷新以保留其信息。然而，由于DRAM在现代计算机系统中部署的规模，DRAM刷新操作的能源开销变得越来越大。管理DRAM刷新的关键是知道特定单元中的数据是否有效。以前的工作已经提出了许多硬件方案，有效地尝试猜测这一点。在本文中，我们提出修改，以允许软件参与调节刷新操作。这为目标刷新操作打开了大门，从而减少了刷新操作。只有具有潜在位错误的有效页面才会被刷新。与传统的刷新整个DRAM相比，我们的SoftRefresh平均节省高达43%的能量。我们的建议可以在所有类型的现代DRAM上工作，只需要对现有的硬件和软件系统进行微小的修改。

引用次数: 0

MAPCP: Memory Access Pattern Classifying Prefetcher 内存访问模式分类预取器

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519328

Manaal Mukhtar Jamadar, Jaspinder Kaur, Shirshendu Das

Prefetching is a technique used to improve system performance by bringing data or instructions in the cache before it is demanded by the core. Several prefetching techniques have been proposed, in both hardware and software, to predict the data to be prefetched with high accuracy and coverage. The memory patterns accessed by applications can be classified as either regular memory access patterns or irregular memory access patterns. Most prefetchers exclusively target either of these patterns by learning from either temporal or spatial correlation among the past data accesses observed. Our proposal focuses on covering all kinds of access patterns which can be predicted by a temporal as well as a spatial prefetcher. Running both kinds of prefetchers in parallel is not a wise design as it leads to unnecessary hardware (storage) overhead for metadata storage of temporal prefetcher. We propose broadly classifying the memory access patterns of applications on the go as regular or irregular, and then using an appropriate prefetcher to issue prefetches for the respective classes. This reduces the metadata requirement in case of temporal prefetcher by 75%. Evaluation of our proposed solution on SPEC CPU 2006 benchmarks achieve a speedup of 23.7% over the no-prefetching baseline, which is a 4% improvement over the state of the art spacial prefetcher BIP, and 13.2% improvement over the temporal prefetcher, Triage.

预取是一种用于提高系统性能的技术，它在内核需要数据或指令之前将数据或指令放入缓存中。从硬件和软件两方面提出了几种预取技术，以较高的精度和覆盖率预测待预取的数据。应用程序访问的内存模式可以分为常规内存访问模式和不规则内存访问模式。大多数预取程序通过从观察到的过去数据访问之间的时间或空间相关性中学习，专门针对这些模式中的任何一种。我们的建议侧重于涵盖所有类型的访问模式，这些模式可以通过时间和空间预取器来预测。并行运行两种预取器并不是明智的设计，因为这会导致临时预取器元数据存储的不必要的硬件(存储)开销。我们建议将运行中的应用程序的内存访问模式大致分类为常规或不规则，然后使用适当的预取器为各自的类发出预取。这将在使用临时预取器的情况下减少75%的元数据需求。我们提出的解决方案在SPEC CPU 2006基准测试上的评估实现了比无预取基线提高23.7%的速度，这比最先进的空间预取器BIP提高了4%，比时间预取器Triage提高了13.2%。

{"title":"MAPCP: Memory Access Pattern Classifying Prefetcher","authors":"Manaal Mukhtar Jamadar, Jaspinder Kaur, Shirshendu Das","doi":"10.1145/3488423.3519328","DOIUrl":"https://doi.org/10.1145/3488423.3519328","url":null,"abstract":"Prefetching is a technique used to improve system performance by bringing data or instructions in the cache before it is demanded by the core. Several prefetching techniques have been proposed, in both hardware and software, to predict the data to be prefetched with high accuracy and coverage. The memory patterns accessed by applications can be classified as either regular memory access patterns or irregular memory access patterns. Most prefetchers exclusively target either of these patterns by learning from either temporal or spatial correlation among the past data accesses observed. Our proposal focuses on covering all kinds of access patterns which can be predicted by a temporal as well as a spatial prefetcher. Running both kinds of prefetchers in parallel is not a wise design as it leads to unnecessary hardware (storage) overhead for metadata storage of temporal prefetcher. We propose broadly classifying the memory access patterns of applications on the go as regular or irregular, and then using an appropriate prefetcher to issue prefetches for the respective classes. This reduces the metadata requirement in case of temporal prefetcher by 75%. Evaluation of our proposed solution on SPEC CPU 2006 benchmarks achieve a speedup of 23.7% over the no-prefetching baseline, which is a 4% improvement over the state of the art spacial prefetcher BIP, and 13.2% improvement over the temporal prefetcher, Triage.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Latency Modulation Scheme for Solid State Drive 低延迟固态硬盘调制方案

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519334

A. Berman

Due to density and performance considerations, the current Flash memory wordline contains more cells than the legacy file-system sector size. Read operation can obtain a page size that is equal to wordline length. However, the typical workload has random read instructions with a standard sector size of 4KB. Therefore, the question arises whether the fact that the read sector is shorter than the page size can be used to reduce the number of data sensing operations. In other words, can a data sector be extracted from a wordline by using fewer sensing operations required to read the whole page? In this paper, we develop a data modulation scheme for low-latency random sector read, referred to as Sector Packing. Our technique reduces latency in multiple bits per cell architecture and can also improve device throughput. For example, in QLC with 16KB pages, latency is reduced by 34%. Two implementation architectures are offered. The first increases channel data traffic and do not require any NAND changes but only in the controller. The second architecture requires adding a small hardware overhead inside the NAND, resulting in reduced data transmission over the SSD channel. Sector Packing is scalable as the gain is higher with more bits per cell.

由于密度和性能方面的考虑，当前闪存字行包含的单元数比遗留文件系统扇区大小多。读取操作可以获得等于字行长度的页大小。但是，典型的工作负载具有标准扇区大小为4KB的随机读指令。因此，问题来了，读扇区比页面大小短这一事实是否可以用来减少数据感知操作的数量。换句话说，是否可以通过使用更少的读取整个页面所需的传感操作从字行中提取数据扇区?在本文中，我们开发了一种低延迟随机扇区读取的数据调制方案，称为扇区封装。我们的技术减少了每单元多比特架构的延迟，还可以提高设备吞吐量。例如，在16KB页面的QLC中，延迟减少了34%。提供了两种实现体系结构。第一个增加通道数据流量，不需要任何NAND更改，但只在控制器中。第二种架构需要在NAND内部增加一个小的硬件开销，导致SSD通道上的数据传输减少。扇区封装是可扩展的，因为每个单元的比特数越多，增益越高。

{"title":"Low-Latency Modulation Scheme for Solid State Drive","authors":"A. Berman","doi":"10.1145/3488423.3519334","DOIUrl":"https://doi.org/10.1145/3488423.3519334","url":null,"abstract":"Due to density and performance considerations, the current Flash memory wordline contains more cells than the legacy file-system sector size. Read operation can obtain a page size that is equal to wordline length. However, the typical workload has random read instructions with a standard sector size of 4KB. Therefore, the question arises whether the fact that the read sector is shorter than the page size can be used to reduce the number of data sensing operations. In other words, can a data sector be extracted from a wordline by using fewer sensing operations required to read the whole page? In this paper, we develop a data modulation scheme for low-latency random sector read, referred to as Sector Packing. Our technique reduces latency in multiple bits per cell architecture and can also improve device throughput. For example, in QLC with 16KB pages, latency is reduced by 34%. Two implementation architectures are offered. The first increases channel data traffic and do not require any NAND changes but only in the controller. The second architecture requires adding a small hardware overhead inside the NAND, resulting in reduced data transmission over the SSD channel. Sector Packing is scalable as the gain is higher with more bits per cell.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"132 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124003340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Writeback Modeling: Theory and Application to Zipfian Workloads 回写建模:Zipfian工作负载的理论和应用

Proceedings of the International Symposium on Memory Systems

Pub Date : 2021-09-27 DOI: 10.1145/3488423.3519331

Wesley Smith, Daniel Byrne, C. Ding

As per-core CPU performance plateaus and data-bound applications like graph analytics and key-value stores become more prevalent, understanding memory performance is more important than ever. Many existing techniques to predict and measure cache performance on a given workload involve either static analysis or tracing, but programs like key-value stores can easily have billions of memory accesses in a trace and have access patterns driven by non-statically observable phenomena such as user behavior. Past analytical solutions focus on modeling cache hits, but the rise of non-volatile memory (NVM) like Intel’s Optane with asymmetric read/write latencies, bandwidths, and power consumption means that writes and writebacks are now critical performance considerations as well, especially in the context of large-scale software caches. We introduce two novel analytical cache writeback models that function for workloads with general frequency distributions; in addition we provide closed-form instantiations for Zipfian workloads, one of the most ubiquitous frequency distribution types in data-bound applications. The models have different use cases and asymptotic runtimes, making them suited for use in different circumstances, but both are fully analytical; cache writeback statistics are computed with no tracing or sampling required. We demonstrate that these models are extremely accurate and fast: the first model, for infinitely large level-two (L2) software cache, averaged 5.0% relative error from ground truth and achieved a minimum speedup over a state-of-the-art trace analysis technique (AET) of 515x to generate writeback information for a single cache size. The second model, which is fully general with respect to L1 and L2 sizes but slower, averaged 3.0% relative error from ground truth and achieved a minimum speedup over AET of 105x for a single cache size.

随着每核CPU性能趋于稳定，以及图形分析和键值存储等数据绑定应用程序变得越来越普遍，理解内存性能比以往任何时候都更加重要。在给定工作负载上预测和测量缓存性能的许多现有技术都涉及静态分析或跟踪，但是像键值存储这样的程序很容易在跟踪中有数十亿的内存访问，并且访问模式是由非静态可观察现象(如用户行为)驱动的。过去的分析解决方案侧重于对缓存命中率的建模，但是像英特尔的Optane这样具有非对称读写延迟、带宽和功耗的非易失性内存(NVM)的兴起意味着写和回写现在也是关键的性能考虑因素，特别是在大规模软件缓存的背景下。我们介绍了两种新的分析缓存回写模型，它们适用于具有一般频率分布的工作负载;此外，我们还为Zipfian工作负载提供了封闭形式的实例化，Zipfian是数据绑定应用程序中最普遍的频率分布类型之一。模型有不同的用例和渐近运行时，使它们适合在不同的情况下使用，但两者都是完全分析的;缓存回写统计信息的计算不需要跟踪或采样。我们证明了这些模型是非常准确和快速的:第一个模型，对于无限大的二级(L2)软件缓存，平均5.0%的相对误差与基本事实相对，并且在最先进的跟踪分析技术(AET)上实现了515倍的最小加速，以生成单一缓存大小的回写信息。第二个模型在L1和L2大小方面是完全通用的，但速度较慢，与基本事实的平均相对误差为3.0%，对于单个缓存大小，在AET上实现了105x的最小加速。

{"title":"Writeback Modeling: Theory and Application to Zipfian Workloads","authors":"Wesley Smith, Daniel Byrne, C. Ding","doi":"10.1145/3488423.3519331","DOIUrl":"https://doi.org/10.1145/3488423.3519331","url":null,"abstract":"As per-core CPU performance plateaus and data-bound applications like graph analytics and key-value stores become more prevalent, understanding memory performance is more important than ever. Many existing techniques to predict and measure cache performance on a given workload involve either static analysis or tracing, but programs like key-value stores can easily have billions of memory accesses in a trace and have access patterns driven by non-statically observable phenomena such as user behavior. Past analytical solutions focus on modeling cache hits, but the rise of non-volatile memory (NVM) like Intel’s Optane with asymmetric read/write latencies, bandwidths, and power consumption means that writes and writebacks are now critical performance considerations as well, especially in the context of large-scale software caches. We introduce two novel analytical cache writeback models that function for workloads with general frequency distributions; in addition we provide closed-form instantiations for Zipfian workloads, one of the most ubiquitous frequency distribution types in data-bound applications. The models have different use cases and asymptotic runtimes, making them suited for use in different circumstances, but both are fully analytical; cache writeback statistics are computed with no tracing or sampling required. We demonstrate that these models are extremely accurate and fast: the first model, for infinitely large level-two (L2) software cache, averaged 5.0% relative error from ground truth and achieved a minimum speedup over a state-of-the-art trace analysis technique (AET) of 515x to generate writeback information for a single cache size. The second model, which is fully general with respect to L1 and L2 sizes but slower, averaged 3.0% relative error from ground truth and achieved a minimum speedup over AET of 105x for a single cache size.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128648636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0