Oliver Lenke, Richard Petri, Thomas Wild, A. Herkersdorf
Near-memory integration strives to tackle the challenge of low data locality and power consumption originating from cross- chip data transfers, meanwhile referred to as locality wall. In order to keep costly engineering efforts bounded when transforming an existing non-near-memory architecture into a near-memory instance, reliable performance estimation during early design stages is needed. We propose PEPERONI, an agile performance estimation model to predict the runtime of representative benchmarks under near-memory acceleration on an MPSoC prototype. By relying solely on measurements of an existing baseline architecture, the method provides reliable estimations on the performance of near-memory processing units before their expensive implementation. The model is based on a quantitative description of memory boundedness and is independent of algorithmic knowledge, what facilitates its applicability to various applications.
{"title":"PEPERONI: Pre-Estimating the Performance of Near-Memory Integration","authors":"Oliver Lenke, Richard Petri, Thomas Wild, A. Herkersdorf","doi":"10.1145/3488423.3519329","DOIUrl":"https://doi.org/10.1145/3488423.3519329","url":null,"abstract":"Near-memory integration strives to tackle the challenge of low data locality and power consumption originating from cross- chip data transfers, meanwhile referred to as locality wall. In order to keep costly engineering efforts bounded when transforming an existing non-near-memory architecture into a near-memory instance, reliable performance estimation during early design stages is needed. We propose PEPERONI, an agile performance estimation model to predict the runtime of representative benchmarks under near-memory acceleration on an MPSoC prototype. By relying solely on measurements of an existing baseline architecture, the method provides reliable estimations on the performance of near-memory processing units before their expensive implementation. The model is based on a quantitative description of memory boundedness and is independent of algorithmic knowledge, what facilitates its applicability to various applications.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126055010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application’s performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline’s approach based on LLC misses indicator.
持久性存储器(PMEM),也称为非易失性存储器(NVM),与DRAM相比,可以提供更高的密度和更低的每比特成本。它的主要缺点是通常比DRAM慢。另一方面,由于其成本和能耗,DRAM存在可扩展性问题。很快,PMEM可能会与DRAM共存于计算机系统中,但最大的挑战是知道在每种类型的内存上分配哪些数据。本文描述了使用Intel Optane DC Persistent Memory识别和描述对应用程序性能影响最大的应用程序对象的方法。在我们工作的第一部分中,我们构建了一个工具,它可以自动分析和分析应用程序对象。在第二部分中,我们建立了一个机器学习模型来预测大规模基于图的应用程序中最关键的对象。我们的结果表明,与使用一组精心挑选的特征相比,使用孤立的特征并没有带来同样的好处。通过使用我们的预测模型执行数据放置,与基于LLC遗漏指标的基线方法相比,我们可以将执行时间降低12%(平均)和30%(最大)。
{"title":"Learning to Rank Graph-based Application Objects on Heterogeneous Memories","authors":"Diego Moura, V. Petrucci, D. Mossé","doi":"10.1145/3488423.3519324","DOIUrl":"https://doi.org/10.1145/3488423.3519324","url":null,"abstract":"Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application’s performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12% (average) and 30% (max) when compared to the baseline’s approach based on LLC misses indicator.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131892438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DRAM memory controllers (MCs) in COTS systems are designed primarily for average performance, offering no worst-case guarantees, while real-time MCs provide timing guarantees at the cost of a significant average performance degradation. For this reason, hardware vendors have been reluctant to integrate real-time solutions in high-performance platforms. In this paper, we overcome this performance-predictability trade-off by introducing DuoMC, a novel memory controller that promotes to augment COTS MCs with a real-time scheduler and run-time monitoring to provide predictability guarantees. Leveraging the fact that the resource is barely overloaded, DuoMC allows the system to enjoy the high-performance of the conventional MC most of the time, while switching to the real-time scheduler only when timing guarantees risk being violated, which rarely occurs. In addition, unlike most existing real-time MCs, DuoMC enables the utilization of both private and shared DRAM banks among cores to facilitate communication among tasks. We evaluate DuoMC using a cycle-accurate multi-core simulator. Results show that DuoMC can provide better or comparable latency guarantees than state-of-the-art real-time MCs with limited performance loss (only 8% in the worst scenario) compared to the COTS MC.
{"title":"DuoMC: Tight DRAM Latency Bounds with Shared Banks and Near-COTS Performance","authors":"Reza Mirosanlou, Mohamed Hassan, R. Pellizzoni","doi":"10.1145/3488423.3519322","DOIUrl":"https://doi.org/10.1145/3488423.3519322","url":null,"abstract":"DRAM memory controllers (MCs) in COTS systems are designed primarily for average performance, offering no worst-case guarantees, while real-time MCs provide timing guarantees at the cost of a significant average performance degradation. For this reason, hardware vendors have been reluctant to integrate real-time solutions in high-performance platforms. In this paper, we overcome this performance-predictability trade-off by introducing DuoMC, a novel memory controller that promotes to augment COTS MCs with a real-time scheduler and run-time monitoring to provide predictability guarantees. Leveraging the fact that the resource is barely overloaded, DuoMC allows the system to enjoy the high-performance of the conventional MC most of the time, while switching to the real-time scheduler only when timing guarantees risk being violated, which rarely occurs. In addition, unlike most existing real-time MCs, DuoMC enables the utilization of both private and shared DRAM banks among cores to facilitate communication among tasks. We evaluate DuoMC using a cycle-accurate multi-core simulator. Results show that DuoMC can provide better or comparable latency guarantees than state-of-the-art real-time MCs with limited performance loss (only 8% in the worst scenario) compared to the COTS MC.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114946489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NVMe Solid-State Drives (SSDs) offer unprecedented throughput and response time for data centers. To increase resource utilization and enable necessary isolation, service providers usually accommodate multiple Virtual Machines (VMs) and lightweight containers on the same physical server. Till today, providing predictable storage performance is still challenging as commercial datacenter NVMe SSDs still appear as black-box block devices. This motivates us to re-examine the I/O stack and firmware design, discuss and quantify the root causes of performance interference. We argue that the semantic gap between predictable performance and the underlying device must be bridged to address this challenge. We propose a split-level design, Zoned FTL (), which enables strong physical isolation for multiple virtualized services with minimal changes in existing storage stacks. We implement the prototype on an SSD emulator and evaluate it under a variety of multi-tenant environments. The evaluation results demonstrate that barely impacts the raw performance while delivering up to 1.51x better throughput and reduce the 99th percentile latency by up to 79.4% in a multi-tenancy environment.
{"title":"Zoned FTL: Achieve Resource Isolation via Hardware Virtualization","authors":"Luyi Kang, B. Jacob","doi":"10.1145/3488423.3519326","DOIUrl":"https://doi.org/10.1145/3488423.3519326","url":null,"abstract":"NVMe Solid-State Drives (SSDs) offer unprecedented throughput and response time for data centers. To increase resource utilization and enable necessary isolation, service providers usually accommodate multiple Virtual Machines (VMs) and lightweight containers on the same physical server. Till today, providing predictable storage performance is still challenging as commercial datacenter NVMe SSDs still appear as black-box block devices. This motivates us to re-examine the I/O stack and firmware design, discuss and quantify the root causes of performance interference. We argue that the semantic gap between predictable performance and the underlying device must be bridged to address this challenge. We propose a split-level design, Zoned FTL (), which enables strong physical isolation for multiple virtualized services with minimal changes in existing storage stacks. We implement the prototype on an SSD emulator and evaluate it under a variety of multi-tenant environments. The evaluation results demonstrate that barely impacts the raw performance while delivering up to 1.51x better throughput and reduce the 99th percentile latency by up to 79.4% in a multi-tenancy environment.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122077069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Vogelsang, Brent Haukness, E. Linstadt, Torsten Partsch, James Tringali
DRAM cells need to be periodically refreshed to preserve the charge stored in them. There are multiple mechanisms causing the loss of charge. These mechanisms also vary in strength. The retention time is therefore not the same for all DRAM cells but follows a distribution with multiple orders of magnitude difference between the retention time of cells with the highest charge loss and the cells with the lowest charge loss. Today's DRAM standards have one refresh interval that is set based on the retention time of the weakest cells that are not replaced by redundancy or repaired with ECC. Refresh adds overhead to DRAM energy consumption and command bandwidth and blocks access to banks currently being refreshed. This overhead increases with larger DRAM capacity and shrinking DRAM feature size. In this paper we propose a method and corresponding circuit implementation that allows using different refresh intervals based on the required minimum retention time of cells on those wordlines. We show that this method can increase the effective refresh interval of a modern DRAM between 40% and 80% without loss of reliability and a corresponding reduction of the contribution of refresh to energy consumption and command bandwidth. Our evaluation shows that the method can be implemented with a moderate DRAM die size impact (between 1% and 2.5%). The method does not require the memory controller to keep track of refresh addresses. After initialization of the DRAM devices, the memory controller needs only to issue refresh commands as today, albeit a smaller number than without our approach.
{"title":"DRAM Refresh with Master Wordline Granularity Control of Refresh Intervals: Position Paper","authors":"T. Vogelsang, Brent Haukness, E. Linstadt, Torsten Partsch, James Tringali","doi":"10.1145/3488423.3519321","DOIUrl":"https://doi.org/10.1145/3488423.3519321","url":null,"abstract":"DRAM cells need to be periodically refreshed to preserve the charge stored in them. There are multiple mechanisms causing the loss of charge. These mechanisms also vary in strength. The retention time is therefore not the same for all DRAM cells but follows a distribution with multiple orders of magnitude difference between the retention time of cells with the highest charge loss and the cells with the lowest charge loss. Today's DRAM standards have one refresh interval that is set based on the retention time of the weakest cells that are not replaced by redundancy or repaired with ECC. Refresh adds overhead to DRAM energy consumption and command bandwidth and blocks access to banks currently being refreshed. This overhead increases with larger DRAM capacity and shrinking DRAM feature size. In this paper we propose a method and corresponding circuit implementation that allows using different refresh intervals based on the required minimum retention time of cells on those wordlines. We show that this method can increase the effective refresh interval of a modern DRAM between 40% and 80% without loss of reliability and a corresponding reduction of the contribution of refresh to energy consumption and command bandwidth. Our evaluation shows that the method can be implemented with a moderate DRAM die size impact (between 1% and 2.5%). The method does not require the memory controller to keep track of refresh addresses. After initialization of the DRAM devices, the memory controller needs only to issue refresh commands as today, albeit a smaller number than without our approach.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current state-of-the-art systems for hybrid memory management are enriched with machine intelligence. To enable the practical use of Machine Learning (ML), system-level page schedulers focus the ML model training over a small subset of the applications’ memory footprint. At the same time, they use existing lightweight historical information to predict the access behavior of majority of the pages. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper explores the opportunities to reduce such operational overheads of machine learning-based hybrid memory page schedulers via use of visualization techniques to depict memory access patterns, and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We propose an initial version of a visualization pipeline for prioritizing pages for machine learning, that is independent of the hybrid memory configuration. Our approach selects pages whose ML-based management delivers, on average, performance levels within 5% of current solutions, while reducing by 75 × the page selection time. We discuss future directions and make a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions.
{"title":"Toward Computer Vision-based Machine Intelligent Hybrid Memory Management","authors":"Thaleia Dimitra Doudali, Ada Gavrilovska","doi":"10.1145/3488423.3519325","DOIUrl":"https://doi.org/10.1145/3488423.3519325","url":null,"abstract":"Current state-of-the-art systems for hybrid memory management are enriched with machine intelligence. To enable the practical use of Machine Learning (ML), system-level page schedulers focus the ML model training over a small subset of the applications’ memory footprint. At the same time, they use existing lightweight historical information to predict the access behavior of majority of the pages. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper explores the opportunities to reduce such operational overheads of machine learning-based hybrid memory page schedulers via use of visualization techniques to depict memory access patterns, and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We propose an initial version of a visualization pipeline for prioritizing pages for machine learning, that is independent of the hybrid memory configuration. Our approach selects pages whose ML-based management delivers, on average, performance levels within 5% of current solutions, while reducing by 75 × the page selection time. We discuss future directions and make a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"50 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124528522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to its capacitive nature, DRAM cells must be refreshed regularly to retain their information. However, due to the scale of DRAM deployment in modern computer systems, the energy overhead of DRAM refresh operations is becoming significant. The crux in managing DRAM refresh is knowing if the data in particular cells are valid or not. Previous works have suggested many hardware schemes that effectively try to guess this. In this paper, we propose modifications to allow software involvement in regulating refresh operations. This opens the door for targeted, and hence minimal, refresh operations. Only valid pages having potential bit errors will be refreshed. Compared to conventionally refreshing the whole DRAM, our SoftRefresh saves up to 43% energy on average. Our proposal can work on all types of modern DRAM with only minor modifications to the existing hardware and software systems.
{"title":"SoftRefresh: Targeted refresh for Energy-efficient DRAM systems via Software and Operating Systems support","authors":"Duy-Thanh Nguyen, Nhut-Minh Ho, I. Chang","doi":"10.1145/3488423.3519323","DOIUrl":"https://doi.org/10.1145/3488423.3519323","url":null,"abstract":"Due to its capacitive nature, DRAM cells must be refreshed regularly to retain their information. However, due to the scale of DRAM deployment in modern computer systems, the energy overhead of DRAM refresh operations is becoming significant. The crux in managing DRAM refresh is knowing if the data in particular cells are valid or not. Previous works have suggested many hardware schemes that effectively try to guess this. In this paper, we propose modifications to allow software involvement in regulating refresh operations. This opens the door for targeted, and hence minimal, refresh operations. Only valid pages having potential bit errors will be refreshed. Compared to conventionally refreshing the whole DRAM, our SoftRefresh saves up to 43% energy on average. Our proposal can work on all types of modern DRAM with only minor modifications to the existing hardware and software systems.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127584741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manaal Mukhtar Jamadar, Jaspinder Kaur, Shirshendu Das
Prefetching is a technique used to improve system performance by bringing data or instructions in the cache before it is demanded by the core. Several prefetching techniques have been proposed, in both hardware and software, to predict the data to be prefetched with high accuracy and coverage. The memory patterns accessed by applications can be classified as either regular memory access patterns or irregular memory access patterns. Most prefetchers exclusively target either of these patterns by learning from either temporal or spatial correlation among the past data accesses observed. Our proposal focuses on covering all kinds of access patterns which can be predicted by a temporal as well as a spatial prefetcher. Running both kinds of prefetchers in parallel is not a wise design as it leads to unnecessary hardware (storage) overhead for metadata storage of temporal prefetcher. We propose broadly classifying the memory access patterns of applications on the go as regular or irregular, and then using an appropriate prefetcher to issue prefetches for the respective classes. This reduces the metadata requirement in case of temporal prefetcher by 75%. Evaluation of our proposed solution on SPEC CPU 2006 benchmarks achieve a speedup of 23.7% over the no-prefetching baseline, which is a 4% improvement over the state of the art spacial prefetcher BIP, and 13.2% improvement over the temporal prefetcher, Triage.
预取是一种用于提高系统性能的技术,它在内核需要数据或指令之前将数据或指令放入缓存中。从硬件和软件两方面提出了几种预取技术,以较高的精度和覆盖率预测待预取的数据。应用程序访问的内存模式可以分为常规内存访问模式和不规则内存访问模式。大多数预取程序通过从观察到的过去数据访问之间的时间或空间相关性中学习,专门针对这些模式中的任何一种。我们的建议侧重于涵盖所有类型的访问模式,这些模式可以通过时间和空间预取器来预测。并行运行两种预取器并不是明智的设计,因为这会导致临时预取器元数据存储的不必要的硬件(存储)开销。我们建议将运行中的应用程序的内存访问模式大致分类为常规或不规则,然后使用适当的预取器为各自的类发出预取。这将在使用临时预取器的情况下减少75%的元数据需求。我们提出的解决方案在SPEC CPU 2006基准测试上的评估实现了比无预取基线提高23.7%的速度,这比最先进的空间预取器BIP提高了4%,比时间预取器Triage提高了13.2%。
{"title":"MAPCP: Memory Access Pattern Classifying Prefetcher","authors":"Manaal Mukhtar Jamadar, Jaspinder Kaur, Shirshendu Das","doi":"10.1145/3488423.3519328","DOIUrl":"https://doi.org/10.1145/3488423.3519328","url":null,"abstract":"Prefetching is a technique used to improve system performance by bringing data or instructions in the cache before it is demanded by the core. Several prefetching techniques have been proposed, in both hardware and software, to predict the data to be prefetched with high accuracy and coverage. The memory patterns accessed by applications can be classified as either regular memory access patterns or irregular memory access patterns. Most prefetchers exclusively target either of these patterns by learning from either temporal or spatial correlation among the past data accesses observed. Our proposal focuses on covering all kinds of access patterns which can be predicted by a temporal as well as a spatial prefetcher. Running both kinds of prefetchers in parallel is not a wise design as it leads to unnecessary hardware (storage) overhead for metadata storage of temporal prefetcher. We propose broadly classifying the memory access patterns of applications on the go as regular or irregular, and then using an appropriate prefetcher to issue prefetches for the respective classes. This reduces the metadata requirement in case of temporal prefetcher by 75%. Evaluation of our proposed solution on SPEC CPU 2006 benchmarks achieve a speedup of 23.7% over the no-prefetching baseline, which is a 4% improvement over the state of the art spacial prefetcher BIP, and 13.2% improvement over the temporal prefetcher, Triage.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to density and performance considerations, the current Flash memory wordline contains more cells than the legacy file-system sector size. Read operation can obtain a page size that is equal to wordline length. However, the typical workload has random read instructions with a standard sector size of 4KB. Therefore, the question arises whether the fact that the read sector is shorter than the page size can be used to reduce the number of data sensing operations. In other words, can a data sector be extracted from a wordline by using fewer sensing operations required to read the whole page? In this paper, we develop a data modulation scheme for low-latency random sector read, referred to as Sector Packing. Our technique reduces latency in multiple bits per cell architecture and can also improve device throughput. For example, in QLC with 16KB pages, latency is reduced by 34%. Two implementation architectures are offered. The first increases channel data traffic and do not require any NAND changes but only in the controller. The second architecture requires adding a small hardware overhead inside the NAND, resulting in reduced data transmission over the SSD channel. Sector Packing is scalable as the gain is higher with more bits per cell.
{"title":"Low-Latency Modulation Scheme for Solid State Drive","authors":"A. Berman","doi":"10.1145/3488423.3519334","DOIUrl":"https://doi.org/10.1145/3488423.3519334","url":null,"abstract":"Due to density and performance considerations, the current Flash memory wordline contains more cells than the legacy file-system sector size. Read operation can obtain a page size that is equal to wordline length. However, the typical workload has random read instructions with a standard sector size of 4KB. Therefore, the question arises whether the fact that the read sector is shorter than the page size can be used to reduce the number of data sensing operations. In other words, can a data sector be extracted from a wordline by using fewer sensing operations required to read the whole page? In this paper, we develop a data modulation scheme for low-latency random sector read, referred to as Sector Packing. Our technique reduces latency in multiple bits per cell architecture and can also improve device throughput. For example, in QLC with 16KB pages, latency is reduced by 34%. Two implementation architectures are offered. The first increases channel data traffic and do not require any NAND changes but only in the controller. The second architecture requires adding a small hardware overhead inside the NAND, resulting in reduced data transmission over the SSD channel. Sector Packing is scalable as the gain is higher with more bits per cell.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"132 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124003340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As per-core CPU performance plateaus and data-bound applications like graph analytics and key-value stores become more prevalent, understanding memory performance is more important than ever. Many existing techniques to predict and measure cache performance on a given workload involve either static analysis or tracing, but programs like key-value stores can easily have billions of memory accesses in a trace and have access patterns driven by non-statically observable phenomena such as user behavior. Past analytical solutions focus on modeling cache hits, but the rise of non-volatile memory (NVM) like Intel’s Optane with asymmetric read/write latencies, bandwidths, and power consumption means that writes and writebacks are now critical performance considerations as well, especially in the context of large-scale software caches. We introduce two novel analytical cache writeback models that function for workloads with general frequency distributions; in addition we provide closed-form instantiations for Zipfian workloads, one of the most ubiquitous frequency distribution types in data-bound applications. The models have different use cases and asymptotic runtimes, making them suited for use in different circumstances, but both are fully analytical; cache writeback statistics are computed with no tracing or sampling required. We demonstrate that these models are extremely accurate and fast: the first model, for infinitely large level-two (L2) software cache, averaged 5.0% relative error from ground truth and achieved a minimum speedup over a state-of-the-art trace analysis technique (AET) of 515x to generate writeback information for a single cache size. The second model, which is fully general with respect to L1 and L2 sizes but slower, averaged 3.0% relative error from ground truth and achieved a minimum speedup over AET of 105x for a single cache size.
{"title":"Writeback Modeling: Theory and Application to Zipfian Workloads","authors":"Wesley Smith, Daniel Byrne, C. Ding","doi":"10.1145/3488423.3519331","DOIUrl":"https://doi.org/10.1145/3488423.3519331","url":null,"abstract":"As per-core CPU performance plateaus and data-bound applications like graph analytics and key-value stores become more prevalent, understanding memory performance is more important than ever. Many existing techniques to predict and measure cache performance on a given workload involve either static analysis or tracing, but programs like key-value stores can easily have billions of memory accesses in a trace and have access patterns driven by non-statically observable phenomena such as user behavior. Past analytical solutions focus on modeling cache hits, but the rise of non-volatile memory (NVM) like Intel’s Optane with asymmetric read/write latencies, bandwidths, and power consumption means that writes and writebacks are now critical performance considerations as well, especially in the context of large-scale software caches. We introduce two novel analytical cache writeback models that function for workloads with general frequency distributions; in addition we provide closed-form instantiations for Zipfian workloads, one of the most ubiquitous frequency distribution types in data-bound applications. The models have different use cases and asymptotic runtimes, making them suited for use in different circumstances, but both are fully analytical; cache writeback statistics are computed with no tracing or sampling required. We demonstrate that these models are extremely accurate and fast: the first model, for infinitely large level-two (L2) software cache, averaged 5.0% relative error from ground truth and achieved a minimum speedup over a state-of-the-art trace analysis technique (AET) of 515x to generate writeback information for a single cache size. The second model, which is fully general with respect to L1 and L2 sizes but slower, averaged 3.0% relative error from ground truth and achieved a minimum speedup over AET of 105x for a single cache size.","PeriodicalId":355696,"journal":{"name":"Proceedings of the International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128648636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}