Proceedings of the 2015 International Symposium on Memory Systems最新文献

英文中文

Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore 近数据处理:3D存储系统架构对Uncore的影响与优化

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818952

S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay

A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.

最近有一个很有前途的发展，可以提供持续的性能扩展，即在多核处理器芯片上堆叠多个DRAM层的能力。本文分析了互连网络与存储层之间的相互作用及其对系统性能的影响。我们探讨了具有DRAM-on-processor堆叠的3D系统的设计考虑因素，并注意到3D的全部优势只能通过配置具有大量通道的存储器来实现。这显著提高了内存级别的并行性，从而降低了每个DRAM组的流量，减少了它们的排队延迟，但增加了互连网络上的流量，使远程访问变得昂贵。为了减少网络上的延迟和流量，我们建议将内存层次结构重组为内存端缓存组织，并探讨了各种地址转换和操作系统页面分配策略的影响。我们的研究结果表明，一个精心设计的3D存储系统已经可以提高25-35%的性能，而无需寻求新的复杂技术。

{"title":"Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore","authors":"S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay","doi":"10.1145/2818950.2818952","DOIUrl":"https://doi.org/10.1145/2818950.2818952","url":null,"abstract":"A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126881457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Opportunities to Upgrade Main Memory 升级主存的机会

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818960

D. Resnick

Hybrid Memory Cube (HMC), in production by Micron Technology, is a new DRAM component that has multiple advantages over current parts including higher bandwidth, lower energy, abstract and more pin efficient interface and other benefits. The memory technology can be used as a base for even further improvements, including upgrading memory scalability to multiple terabytes and terabyte per second bandwidths per processor and resilience such that even large supercomputers with 100s of petabytes of memory will have reliable memory systems. Future systems, from desktops up, will have memory systems of multiple levels, including DRAM and non-volatile (NAND?) components that are both first-level memory capabilities, along with DRAM or SRAM scratch memory such that total data motion is greatly reduced. The result can be improved system performance and reduced system power.

美光公司正在生产的混合内存立方体(HMC)是一种新型DRAM组件，与现有组件相比，具有更高的带宽、更低的能耗、抽象和更高效的引脚接口等诸多优势。内存技术可以作为进一步改进的基础，包括将内存可伸缩性升级到多个tb和每个处理器每秒tb的带宽，以及弹性，这样即使是具有100 pb内存的大型超级计算机也将拥有可靠的内存系统。未来的系统，从台式电脑开始，将拥有多级存储系统，包括DRAM和非易失性(NAND)组件，它们都是第一级存储功能，以及DRAM或SRAM刮刮存储器，这样可以大大减少总数据移动。结果可以提高系统性能，降低系统功耗。

引用次数: 3

Writing without Disturb on Phase Change Memories by Integrating Coding and Layout Design 集成编码与布局设计的相变存储器无干扰写入

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818962

A. Eslami, Alfredo J. Velasco, Alireza Vahid, Georgios Mappouras, A. Calderbank, Daniel J. Sorin

We integrate coding techniques and layout design to eliminate write-disturb in phase change memories (PCMs), while enhancing lifetime and host-visible capacity. We first propose a checkerboard configuration for cell layout to eliminate write-disturb while doubling the memory lifetime. We then introduce two methods to jointly design Write-Once-Memory (WOM) codes and layout. The first WOM-layout design improves the lifetime by more than double without compromising the host-visible capacity. The second design applies WOM codes to even more dense layouts to achieve both lifetime and capacity gains. The constructions demonstrate that substantial improvements to lifetime and host-visible capacity are possible by co-designing coding and cell layout in PCM.

我们整合编码技术和布局设计，以消除相变存储器(pcm)的写干扰，同时提高寿命和主机可见容量。我们首先提出了格子布局的棋盘配置，以消除写入干扰，同时使内存寿命加倍。然后介绍了两种共同设计WOM (Write-Once-Memory)编码和布局的方法。第一种wom布局设计在不影响主机可见容量的情况下，将寿命提高了一倍以上。第二种设计将WOM代码应用到更密集的布局中，以实现寿命和容量的提升。这些结构表明，通过共同设计PCM中的编码和单元布局，可以显著改善寿命和主机可见容量。

引用次数: 10

Bringing Modern Hierarchical Memory Systems Into Focus: A study of architecture and workload factors on system performance 将现代分层存储系统引入焦点:系统性能的架构和工作负载因素的研究

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818975

Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob

The increasing size of workloads has led to the development of new technologies and architectures that are intended to help address the capacity limitations of DRAM main memories. The proposed solutions fall into two categories: those that re-engineer Flash-based SSDs to further improve storage system performance and those that incorporate non-volatile technology into a Hybrid main memory system. These developments have blurred the line between the storage and memory systems. In this paper, we examine the differences between these two approaches to gain insight into the types of applications and memory technologies that benefit the most from these different architectural approaches. In particular this work utilizes full system simulation to examine the impact of workload randomness on system performance, the impact of backing store latency on system performance, and how the different implementations utilize system resources differently. We find that the software overhead incurred by storage based implementations can account for almost 50% of the overall access latency. As a result, backing store technologies that have an access latency up to 25 microseconds tend to perform better when implemented as part of the main memory system. We also see that high degrees of random access can exacerbate the software overhead problem and lead to large performance advantages for the Hybrid main memory approach. Meanwhile, the page replacement algorithm utilized by the OS in the storage approach results in considerably better performance on highly sequential workloads at the cost of greater pressure on the cache.

工作负载的不断增加导致了新技术和体系结构的发展，这些技术和体系结构旨在帮助解决DRAM主存储器的容量限制。提出的解决方案分为两类:一类是重新设计基于闪存的ssd以进一步提高存储系统性能，另一类是将非易失性技术纳入混合主存储系统。这些发展模糊了存储系统和记忆系统之间的界限。在本文中，我们将研究这两种方法之间的差异，以深入了解从这些不同的体系结构方法中获益最多的应用程序类型和内存技术。特别是，这项工作利用完整的系统模拟来检查工作负载随机性对系统性能的影响，后备存储延迟对系统性能的影响，以及不同的实现如何以不同的方式利用系统资源。我们发现，基于存储的实现所产生的软件开销几乎占总访问延迟的50%。因此，访问延迟高达25微秒的后备存储技术在作为主内存系统的一部分实现时往往性能更好。我们还看到，高度随机访问会加剧软件开销问题，并为混合主存方法带来巨大的性能优势。同时，操作系统在存储方法中使用的页面替换算法在高度顺序的工作负载上产生了相当好的性能，但代价是对缓存施加了更大的压力。

{"title":"Bringing Modern Hierarchical Memory Systems Into Focus: A study of architecture and workload factors on system performance","authors":"Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob","doi":"10.1145/2818950.2818975","DOIUrl":"https://doi.org/10.1145/2818950.2818975","url":null,"abstract":"The increasing size of workloads has led to the development of new technologies and architectures that are intended to help address the capacity limitations of DRAM main memories. The proposed solutions fall into two categories: those that re-engineer Flash-based SSDs to further improve storage system performance and those that incorporate non-volatile technology into a Hybrid main memory system. These developments have blurred the line between the storage and memory systems. In this paper, we examine the differences between these two approaches to gain insight into the types of applications and memory technologies that benefit the most from these different architectural approaches. In particular this work utilizes full system simulation to examine the impact of workload randomness on system performance, the impact of backing store latency on system performance, and how the different implementations utilize system resources differently. We find that the software overhead incurred by storage based implementations can account for almost 50% of the overall access latency. As a result, backing store technologies that have an access latency up to 25 microseconds tend to perform better when implemented as part of the main memory system. We also see that high degrees of random access can exacerbate the software overhead problem and lead to large performance advantages for the Hybrid main memory approach. Meanwhile, the page replacement algorithm utilized by the OS in the storage approach results in considerably better performance on highly sequential workloads at the cost of greater pressure on the cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Opportunities and Challenges of Performing Vector Operations inside the DRAM 在DRAM内执行矢量运算的机遇与挑战

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818953

M. Alves, P. C. Santos, M. Diener, L. Carro

In order to overcome the low memory bandwidth and the high energy costs associated with the data transfer between the processor and the main memory, proposals on near-data computing started to gain acceptance in systems ranging from embedded architectures to high performance computing. The main previous approaches propose application specific hardware or require a large amount of logic. Moreover, most proposals require algorithm changes and do not make use of the full parallelism available on the DRAM devices. These issues limits the adoption and the performance of near-data computing. In this paper, we propose to implement vector instructions directly inside the DRAM devices, which we call the Memory Vector Extensions (MVX). This balanced approach reduces data movement between the DRAM to the processor while requiring a low amount of hardware to achieve good performance. Comparing to current vector operations present on processors, our proposal enable performance gains of up to 97x and reduces the energy consumption by up to 70x of the full system.

为了克服处理器和主存储器之间数据传输的低内存带宽和高能量成本，从嵌入式架构到高性能计算，近数据计算的建议开始在系统中得到接受。以前的主要方法提出了特定于应用程序的硬件或需要大量的逻辑。此外，大多数建议需要修改算法，并且没有利用DRAM设备上可用的完全并行性。这些问题限制了近数据计算的采用和性能。在本文中，我们建议直接在DRAM器件内部实现向量指令，我们称之为内存向量扩展(MVX)。这种平衡的方法减少了DRAM到处理器之间的数据移动，同时需要少量的硬件来实现良好的性能。与目前处理器上的矢量运算相比，我们的提议使性能提高了97倍，并将整个系统的能耗降低了70倍。

引用次数: 11

Achieving Yield, Density and Performance Effective DRAM at Extreme Technology Sizes 在极限技术尺寸下实现有效的DRAM产量、密度和性能

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818963

B. Childers, Jun Yang, Youtao Zhang

For over forty years, DRAM has been the most compelling choice for main memory. It is a well understood commodity technology that strikes an ideal balance between cost, performance, capacity and energy. Yet, as DRAM scales to the extremes of deep submicron technology, it faces a critical challenge with the impact of process variation (PV) on chip yield: PV in the transistor and capacitor used to hold a bit of information, along with other components, can cause critical requirements to be violated, including retention capability, cell reliability and operational timing. The challenges of retention and reliability are well known. However, the latter challenge has received significantly less attention---the impact of operational timing violations due to PV on DRAM yield. This challenge stands as an equal to the others in achieving sufficient yield for continued commodity production of DRAM. In this paper, we argue that timing requirements must be relaxed and exposed on a per-location basis for management by the memory sub-system architecture to overcome the challenge to yield from timing. This "soft yield" approach trades exposed timing variability for enhanced yield, without harming chip density. Because relaxing and exposing variable timing can lead to application performance loss, a suite of techniques must be developed by the architecture community to mitigate the loss. We raise awareness of this problem and suggest directions where solutions may be found.

四十多年来，DRAM一直是主存储器最引人注目的选择。这是一种众所周知的商品技术，在成本、性能、容量和能源之间取得了理想的平衡。然而，随着DRAM扩展到深度亚微米技术的极端，它面临着工艺变化(PV)对芯片良率的影响的关键挑战:用于保存信息的晶体管和电容器中的PV以及其他组件可能导致违反关键要求，包括保持能力，电池可靠性和操作时间。留存率和可靠性的挑战是众所周知的。然而，后一个挑战受到的关注要少得多，即PV对DRAM良率造成的操作时间违规的影响。这一挑战与其他挑战一样，都是为了实现足够的产量，以实现DRAM的持续大宗商品生产。在本文中，我们认为时序要求必须放宽，并暴露在每个位置的基础上，由内存子系统架构进行管理，以克服时序产生的挑战。这种“软产量”方法在不损害芯片密度的情况下，将暴露的时间变异性用于提高产量。因为放松和公开可变时间可能会导致应用程序性能损失，所以体系结构社区必须开发一套技术来减轻这种损失。我们提高对这一问题的认识，并提出可能找到解决办法的方向。

{"title":"Achieving Yield, Density and Performance Effective DRAM at Extreme Technology Sizes","authors":"B. Childers, Jun Yang, Youtao Zhang","doi":"10.1145/2818950.2818963","DOIUrl":"https://doi.org/10.1145/2818950.2818963","url":null,"abstract":"For over forty years, DRAM has been the most compelling choice for main memory. It is a well understood commodity technology that strikes an ideal balance between cost, performance, capacity and energy. Yet, as DRAM scales to the extremes of deep submicron technology, it faces a critical challenge with the impact of process variation (PV) on chip yield: PV in the transistor and capacitor used to hold a bit of information, along with other components, can cause critical requirements to be violated, including retention capability, cell reliability and operational timing. The challenges of retention and reliability are well known. However, the latter challenge has received significantly less attention---the impact of operational timing violations due to PV on DRAM yield. This challenge stands as an equal to the others in achieving sufficient yield for continued commodity production of DRAM. In this paper, we argue that timing requirements must be relaxed and exposed on a per-location basis for management by the memory sub-system architecture to overcome the challenge to yield from timing. This \"soft yield\" approach trades exposed timing variability for enhanced yield, without harming chip density. Because relaxing and exposing variable timing can lead to application performance loss, a suite of techniques must be developed by the architecture community to mitigate the loss. We raise awareness of this problem and suggest directions where solutions may be found.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128168891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Energy Efficient Scale-In Clusters with In-Storage Processing for Big-Data Analytics 节能规模内集群与存储处理大数据分析

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818983

I. Choi, Yang-Suk Kee

Big data drives a computing paradigm shift. Due to enormous data volumes, data-intensive programming frameworks are pervasive and scale-out clusters are widespread. As a result, data-movement energy dominates overall energy consumption and this will get worse with a technology scaling. We propose scale-in clusters with In-Storage Processing (ISP) devices that would enable energy efficient computing for big-data analytics. ISP devices eliminate/reduce data movements towards CPUs and execute tasks more energy-efficiently. Thus, with energy efficient computing near data and higher throughput enabled, clusters with ISP can achieve more than quadruple energy efficiency with fewer number of nodes as compared to the energy efficiency of similarly performing its counter-part scale-out clusters.

大数据推动了计算范式的转变。由于庞大的数据量，数据密集型编程框架非常普遍，横向扩展集群也很普遍。因此，数据移动能耗主导了整体能耗，随着技术的扩展，这种情况会变得更糟。我们建议使用存储处理(ISP)设备扩展集群，这将为大数据分析提供节能计算。ISP设备消除/减少了向cpu的数据移动，并更节能地执行任务。因此，通过启用近数据的节能计算和更高的吞吐量，具有ISP的集群可以使用更少的节点实现四倍以上的能源效率，而不是执行类似的对等部分横向扩展集群的能源效率。

引用次数: 14

Herniated Hash Tables: Exploiting Multi-Level Phase Change Memory for In-Place Data Expansion 突出哈希表:利用多层次相变存储器进行就地数据扩展

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818981

Zhaoxia Deng, Lunkai Zhang, D. Franklin, F. Chong

Hash tables are a commonly used data structure used in many algorithms and applications. As applications and data scale, the efficient implementation of hash tables becomes increasingly important and challenging. In particular, memory capacity becomes increasingly important and entries can become asymmetrically chained across hash buckets. This chaining prevents two forms of parallelism: memory-level parallelism (allowing multiple prefetch requests to overlap) and memory-computation parallelism (allowing computation to overlap memory operations). We propose, herniated hash tables, a technique that exploits multi-level phase change memory (PCM) storage to expand storage at each hash bucket and increase parallelism without increasing physical space. The technique works by increasing the number of bits stored within the same resistance range of an individual PCM cell. We pack more data into the same bit by decreasing noise margins, and we pay for this higher density with higher latency reads and writes that resolve the more accurate resistance values. Furthermore, our organization, coupled with an addressing and prefetching scheme, increases memory parallelism of the herniated datastructure. We simulate our system with a variety of hash table applications and evaluate the density and performance benefits in comparison to a number of baseline systems. Compared with conventional chained hash tables on single-level PCM, herniated hash tables can achieve 4.8x density on a 4-level PCM while achieving up to 67% performance improvement.

哈希表是许多算法和应用程序中常用的数据结构。随着应用程序和数据的扩展，哈希表的有效实现变得越来越重要和具有挑战性。特别是，内存容量变得越来越重要，并且条目可以跨散列桶进行非对称链接。这种链接防止了两种形式的并行性:内存级并行性(允许多个预取请求重叠)和内存计算并行性(允许计算重叠内存操作)。我们提出了herniated哈希表，这是一种利用多级相变存储器(PCM)存储来扩展每个哈希桶的存储并在不增加物理空间的情况下增加并行性的技术。该技术的工作原理是增加单个PCM单元相同电阻范围内存储的比特数。我们通过减少噪声余量将更多的数据压缩到相同的位中，并且我们通过更高的读取和写入延迟来支付更高的密度，从而解决更准确的电阻值。此外，我们的组织与寻址和预取方案相结合，增加了压缩数据结构的内存并行性。我们用各种哈希表应用程序模拟我们的系统，并与许多基线系统进行比较，评估密度和性能优势。与单级PCM上的传统链式哈希表相比，突出哈希表在4级PCM上可以实现4.8倍的密度，同时实现高达67%的性能提升。

{"title":"Herniated Hash Tables: Exploiting Multi-Level Phase Change Memory for In-Place Data Expansion","authors":"Zhaoxia Deng, Lunkai Zhang, D. Franklin, F. Chong","doi":"10.1145/2818950.2818981","DOIUrl":"https://doi.org/10.1145/2818950.2818981","url":null,"abstract":"Hash tables are a commonly used data structure used in many algorithms and applications. As applications and data scale, the efficient implementation of hash tables becomes increasingly important and challenging. In particular, memory capacity becomes increasingly important and entries can become asymmetrically chained across hash buckets. This chaining prevents two forms of parallelism: memory-level parallelism (allowing multiple prefetch requests to overlap) and memory-computation parallelism (allowing computation to overlap memory operations). We propose, herniated hash tables, a technique that exploits multi-level phase change memory (PCM) storage to expand storage at each hash bucket and increase parallelism without increasing physical space. The technique works by increasing the number of bits stored within the same resistance range of an individual PCM cell. We pack more data into the same bit by decreasing noise margins, and we pay for this higher density with higher latency reads and writes that resolve the more accurate resistance values. Furthermore, our organization, coupled with an addressing and prefetching scheme, increases memory parallelism of the herniated datastructure. We simulate our system with a variety of hash table applications and evaluate the density and performance benefits in comparison to a number of baseline systems. Compared with conventional chained hash tables on single-level PCM, herniated hash tables can achieve 4.8x density on a 4-level PCM while achieving up to 67% performance improvement.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127914061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

SIMT-based Logic Layers for Stacked DRAM Architectures: A Prototype 基于simm的堆叠DRAM架构逻辑层:一个原型

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818954

C. Kersey, S. Yalamanchili, Hyesoon Kim

Stacked DRAM products are now available, and the likelihood of future products combining DRAM stacks with custom logic layers seems high. The near-memory processor in such a system will have to be energy efficient, latency tolerant, and capable of exploiting both high memory-level parallelism and high memory bandwidth. We believe that single-instruction-multiple-thread (SIMT) processors are uniquely suited to this task, and for the purpose of evaluating this claim have produced an FPGA-based prototype.

堆叠式DRAM产品现在已经可用，而且未来产品将DRAM堆叠与定制逻辑层结合的可能性似乎很高。这种系统中的近内存处理器必须节能、容忍延迟，并且能够利用高内存级并行性和高内存带宽。我们认为单指令多线程(SIMT)处理器非常适合这项任务，为了评估这一说法，我们制作了一个基于fpga的原型。

引用次数: 3

Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads 缓存层次结构中的低效率:移动工作负载下缓存大小的敏感性研究

Proceedings of the 2015 International Symposium on Memory Systems

Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818980

A. Laer, William Wang, C. D. Emmons

With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.

随着移动设备中核心数量的增加，移动应用程序处理器中的缓存层次结构越来越深，缓存大小也越来越大。然而，在过去十年中，移动应用程序处理器的缓存大小保持相对稳定。在这项工作中，我们研究了移动应用程序处理器中的缓存大小是否需要刷新，通过观察缓存层次结构中的低效率，当增加缓存大小时，这种低效率往往会加剧:错误共享和缓存利用率。首先，我们看看虚假共享，这更有可能出现在更大的缓存大小，并可能严重影响性能。当运行在不同内核上的线程访问映射到同一cacheline的非共享数据结构时，就会发生错误共享，从而导致可避免的失效和随后的失误。在科学工作负载和实际应用程序等各个地方都发现了虚假共享。我们发现，虽然增加缓存大小确实会增加错误共享，但与科学工作负载中已知的错误共享案例相比，它仍然可以忽略不计，因为移动工作负载中的线程级并行性水平有限。其次，我们查看cacheline利用率，它测量处理器实际使用的cacheline中的字节数。我们已经对许多服务器和桌面应用程序的不同名称进行了研究。由于低cacheline利用率意味着处理器只使用了很少的获取的cacheline，这会导致在内存层次结构中移动数据时浪费带宽和能量。与逻辑操作相比，与数据移动相关的能源成本要高得多，这增加了对缓存效率的需求，特别是在像移动设备这样的能量受限平台的情况下。我们发现，移动工作负载的缓存利用率通常较低，随着缓存大小的增加而降低。当将缓存大小从64字节增加到128字节时，根据工作负载的不同，丢失的数量将减少10%- 30%。但是，由于缓存利用率较低，这使L1缓存的未使用流量增加了一倍以上。以这种方式使用缓存利用率作为度量，说明了重要的一点。如果cacheline大小的变化只对其局部影响进行评估，我们发现cacheline大小的变化只会随着脱靶率的降低而具有优势。但是，在系统级别，这种更改将增加总线上的压力，并增加由于未使用的流量而浪费的能量。使用缓存利用率作为度量，强调在更改缓存层次结构的特征时需要进行系统级研究。

{"title":"Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads","authors":"A. Laer, William Wang, C. D. Emmons","doi":"10.1145/2818950.2818980","DOIUrl":"https://doi.org/10.1145/2818950.2818980","url":null,"abstract":"With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115619118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2015 International Symposium on Memory Systems

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀