2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献_第3页

A scalable processing-in-memory accelerator for parallel graph processing 用于并行图形处理的可扩展内存处理加速器

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750386

Junwhan Ahn, Sungpack Hong, S. Yoo, O. Mutlu, Kiyoung Choi

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

数字数据的爆炸式增长和对快速数据分析的日益增长的需求使得内存大数据处理在计算机系统中变得越来越重要。特别是大规模图处理由于其从社会科学到机器学习的广泛适用性而受到关注。然而，能够在主存中有效处理大型图形的可扩展硬件设计仍然是一个开放的问题。理想情况下，成本效益和可扩展的图形处理系统可以通过构建一个系统来实现，该系统的性能与系统中可以存储的图形大小成比例地增加，这在传统系统中由于严重的内存带宽限制而极具挑战性。在这项工作中，我们认为内存中处理(PIM)的传统概念可以成为实现这一目标的可行解决方案。PIM的关键现代推动者是3D集成技术的最新进展，该技术有助于将逻辑和存储芯片堆叠在单个封装中，这在最初研究PIM概念时是不可用的。为了利用这种新技术来实现内存容量比例性能，我们设计了一个可编程的PIM加速器，用于大规模图形处理，称为Tesseract。Tesseract由(1)充分利用可用内存带宽的新硬件架构，(2)不同内存分区之间有效的通信方法，以及(3)反映和利用独特硬件设计的编程接口组成。它还包括两个专门用于图形处理的内存访问模式的硬件预取器，它们基于我们的编程模型提供的提示进行操作。我们使用五种最先进的图形处理工作负载和大型真实世界图形进行综合评估，结果表明，所提出的架构将平均系统性能提高了十倍，并且比传统系统平均节能87%。

{"title":"A scalable processing-in-memory accelerator for parallel graph processing","authors":"Junwhan Ahn, Sungpack Hong, S. Yoo, O. Mutlu, Kiyoung Choi","doi":"10.1145/2749469.2750386","DOIUrl":"https://doi.org/10.1145/2749469.2750386","url":null,"abstract":"The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations. In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"29 1","pages":"105-117"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80303922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 722

Towards sustainable in-situ server systems in the big data era 迈向大数据时代可持续的原位服务器系统

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750381

Chao Li, Yang Hu, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, Tao Li

Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.

近年来，来自各种分布式来源(如无处不在的摄像头和各种传感器)的数据量呈爆炸式增长。由于大量的数据移动开销、耗时的数据聚合和不断升级的能源需求，分析这些地理上分散的数据集的挑战正在增加。与其不断地将大量原始数据移动到远程仓库规模的计算系统进行处理，不如利用原位服务器系统(InS)对数据进行预处理，即将计算带到数据所在的位置。本文为设计用于现场数据处理的服务器集群迈出了第一步。我们调查了两个代表性的原位计算应用程序，其中数据通常来自环境敏感地区或缺乏既定公用设施的偏远地区。这些非常特殊的现场服务器操作环境促使我们探索独立(即离网)系统，这些系统提供了从本地自产能源中获益的机会。在这项工作中，我们实现了一个名为InSURE的概念验证原型:使用可再生能源的原位服务器系统。我们开发了一种新的能量缓冲机制和一种独特的联合时空电源管理策略来协调独立电源和原位服务器。我们提供了详细的部署经验，以量化我们的设计如何适应现实世界中的原位处理。总体而言，与最先进的基线相比，InSURE的产量提高了20%~60%。它在供应不足的环境中保持了令人印象深刻的控制有效性，并且可以根据数据处理需求进行经济扩展。所提出的设计可以很好地补充当今的并网云数据中心，并提供具有竞争力的成本效益。

{"title":"Towards sustainable in-situ server systems in the big data era","authors":"Chao Li, Yang Hu, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, Tao Li","doi":"10.1145/2749469.2750381","DOIUrl":"https://doi.org/10.1145/2749469.2750381","url":null,"abstract":"Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"112 1","pages":"14-26"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79340326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Data reorganization in memory using 3D-stacked DRAM 使用3d堆叠DRAM的内存数据重组

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750397

Berkin Akin, F. Franchetti, J. Hoe

In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations. We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.

在本文中，我们着重于常见的数据重组操作，如shuffle, pack/unpack, swap，转置和布局转换。尽管这些操作只是在内存中重新定位数据，但在传统系统上，由于低效的访问模式、有限的数据重用和在整个内存层次结构中的往返数据遍历，它们的成本很高。本文提出了一种双管齐下的有效数据重组方法，该方法结合了(i)在3d堆叠DRAM中集成的ram感知重塑加速器，以及(ii)用于表示和优化重组操作的数学框架。我们通过两个主要用例来评估我们建议的系统。首先，我们演示了重塑加速器通过数据布局转换执行物理地址重新映射，以更有效地利用3d堆叠DRAM结构的内部并行性/局部性，用于通用工作负载。然后，我们着重于卸载和加速从Intel Math Kernel Library包中选择的常用数据重组例程。我们通过将我们的方法与现有的最先进的gpu和cpu上的优化实现进行比较，来评估我们的方法的能源和性能优势。对于各种测试用例，内存中的数据重组通过低开销硬件提供了数量级的性能和能源效率改进。

{"title":"Data reorganization in memory using 3D-stacked DRAM","authors":"Berkin Akin, F. Franchetti, J. Hoe","doi":"10.1145/2749469.2750397","DOIUrl":"https://doi.org/10.1145/2749469.2750397","url":null,"abstract":"In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations. We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"78 1","pages":"131-143"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78513816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 178

FaultHound: Value-locality-based soft-fault tolerance FaultHound:基于值-位置的软容错

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750372

Nitin, I. Pomeranz, T. N. Vijaykumar

Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).

随着CMOS的持续扩展，软误差敏感性日益受到关注。以前的工作探讨了软容错的硬件和软件的全冗余和部分冗余方案。然而，全冗余方案带来高性能和能源开销，而部分冗余方案实现低覆盖率。一项最初的研究，称为基于扰动的故障筛选(PBFS)，探索利用值局域性来提供软故障的提示，每当一个值落在其邻域之外。PBFS采用位掩码过滤器捕获值邻域。然而，PBFS实现了低覆盖率;直接提高覆盖率会导致高误报率，以及性能和能源开销。本文提出了一种基于值-位置的软容错方案FaultHound，该方案采用五种机制来解决PBFS的局限性:(1)通过过滤器表的反向组织对过滤器进行聚类，以加强学习并降低误报率;(2)忽略多次误报的错误位的学习方案，进一步降低误报率;(3)采用轻量级前导重放方案代替完全回滚，以减少剩余误报的性能和能量损失;(4)区分重命名错误(需要回滚而不是重放恢复)和误报的简单方案，以避免不必要的回滚惩罚;(5)一种避免回滚的检测方案，用于重播不包括的负载存储队列。通过模拟，我们表明，虽然PBFS在高性能开销(97%)下实现了低覆盖率(30%)或高假阳性率(8%)，但FaultHound在较低的性能和能量开销(10%和25%)下实现了更高的覆盖率(75%)和较低的假阳性率(3%)。

{"title":"FaultHound: Value-locality-based soft-fault tolerance","authors":"Nitin, I. Pomeranz, T. N. Vijaykumar","doi":"10.1145/2749469.2750372","DOIUrl":"https://doi.org/10.1145/2749469.2750372","url":null,"abstract":"Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"668-681"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88649731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions 灵活的自动刷新:实现可扩展和节能的DRAM刷新减少

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750408

Ishwar Bhati, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob

DRAM cells require periodic refreshing to preserve data. In JEDEC DDRx devices, a refresh operation is performed via an auto-refresh command, which refreshes multiple rows in multiple banks simultaneously. The internal implementation of auto-refresh is completely opaque outside the DRAM - all the memory controller can do is to instruct the DRAM to refresh itself - the DRAM handles all else, in particular determining which rows in which banks are to be refreshed. This is in conflict with a large body of research on reducing the refresh overhead, in which the memory controller needs fine-grained control over which regions of the memory are refreshed. For example, prior works exploit the fact that a subset of DRAM rows can be refreshed at a slower rate than other rows due to access rate or retention period variations. However, such row-granularity approaches cannot use the standard auto-refresh command, which refreshes an entire batch of rows at once and does not permit skipping of rows. Consequently, prior schemes are forced to use explicit sequences of activate (ACT) and precharge (PRE) operations to mimic row-level refreshing. The drawback is that, compared to using JEDEC's auto-refresh mechanism, using explicit ACT and PRE commands is inefficient, both in terms of performance and power. In this paper, we show that even when skipping a high percentage of refresh operations, existing row-granurality refresh techniques are mostly ineffective due to the inherent efficiency disparity between ACT/PRE and the JEDEC auto-refresh mechanism. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter. We also introduce a new “dummy refresh” command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.

DRAM单元需要定期刷新以保存数据。在JEDEC DDRx设备中，刷新操作是通过自动刷新命令执行的，该命令同时刷新多个银行中的多行。自动刷新的内部实现在DRAM之外是完全不透明的——所有内存控制器所能做的就是指示DRAM刷新自己——DRAM处理所有其他的事情，特别是决定哪个银行中的哪些行要刷新。这与大量关于减少刷新开销的研究相冲突，在这些研究中，内存控制器需要对刷新内存的哪些区域进行细粒度控制。例如，先前的研究利用了这样一个事实，即由于访问速率或保留周期的变化，DRAM行的子集可以以比其他行的更慢的速度刷新。但是，这种行粒度方法不能使用标准的自动刷新命令，后者一次刷新一整批行，并且不允许跳过行。因此，先前的方案被迫使用显式的激活(ACT)和预充(PRE)操作序列来模拟行级刷新。缺点是，与使用JEDEC的自动刷新机制相比，使用显式的ACT和PRE命令在性能和功耗方面都是低效的。在本文中，我们表明，由于ACT/PRE和JEDEC自动刷新机制之间固有的效率差异，即使跳过高比例的刷新操作，现有的行粒度刷新技术也大多无效。我们建议对DRAM进行修改，扩展其现有的控制寄存器访问协议，以包括DRAM的内部刷新计数器。我们还引入了一个新的“虚拟刷新”命令，它跳过刷新操作，只增加内部计数器。我们表明，这些修改允许内存控制器减少与以前工作一样多的刷新，同时通过大多数时间使用自动刷新实现显著的能量和性能优势。

{"title":"Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions","authors":"Ishwar Bhati, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob","doi":"10.1145/2749469.2750408","DOIUrl":"https://doi.org/10.1145/2749469.2750408","url":null,"abstract":"DRAM cells require periodic refreshing to preserve data. In JEDEC DDRx devices, a refresh operation is performed via an auto-refresh command, which refreshes multiple rows in multiple banks simultaneously. The internal implementation of auto-refresh is completely opaque outside the DRAM - all the memory controller can do is to instruct the DRAM to refresh itself - the DRAM handles all else, in particular determining which rows in which banks are to be refreshed. This is in conflict with a large body of research on reducing the refresh overhead, in which the memory controller needs fine-grained control over which regions of the memory are refreshed. For example, prior works exploit the fact that a subset of DRAM rows can be refreshed at a slower rate than other rows due to access rate or retention period variations. However, such row-granularity approaches cannot use the standard auto-refresh command, which refreshes an entire batch of rows at once and does not permit skipping of rows. Consequently, prior schemes are forced to use explicit sequences of activate (ACT) and precharge (PRE) operations to mimic row-level refreshing. The drawback is that, compared to using JEDEC's auto-refresh mechanism, using explicit ACT and PRE commands is inefficient, both in terms of performance and power. In this paper, we show that even when skipping a high percentage of refresh operations, existing row-granurality refresh techniques are mostly ineffective due to the inherent efficiency disparity between ACT/PRE and the JEDEC auto-refresh mechanism. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter. We also introduce a new “dummy refresh” command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"235-246"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85175731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Efficient execution of memory access phases using dataflow specialization 使用数据流专门化有效地执行内存访问阶段

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750390

C. Ho, Sung Jin Kim, K. Sankaralingam

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.

本文提出了一个提高处理器核心效率的新机会:程序的存储器访问阶段。这些是程序的动态区域，其中大部分指令用于内存访问或地址计算。由于工作负载属性，这些情况在程序中很自然地发生，或者当使用核心内加速器时，我们会得到诱导阶段，其中核心上的代码执行是访问代码。我们观察到这样的代码需要OOO核心的数据流和动态才能快速运行，并且在有序处理器上执行得不好。但是，OOO内核消耗大量的功率，有效地增加了能量消耗，降低了内核内加速器的能量效率。我们开发了一个称为内存访问数据流(MAD)的执行模型，该模型对数据流计算、事件-条件-操作规则和显式操作进行编码。使用它，我们构建了一个专门的引擎，它提供了OOO核心的性能，但功率只是它的一小部分。这样的引擎可以作为任何加速器执行其各自诱导阶段的通用方式，从而为当前和未来的加速器提供通用接口和实现。我们在RTL中设计并实现了MAD，并通过集成四种不同的加速器(SSE, DySER, NPU和C-Cores)来展示其通用性和灵活性。我们的定量结果表明，相对于有序的2宽OOO和4宽OOO, MAD分别提供2.4倍、1.4倍和等效的性能。提供0.8倍、0.6倍、0.4倍的低能量。

{"title":"Efficient execution of memory access phases using dataflow specialization","authors":"C. Ho, Sung Jin Kim, K. Sankaralingam","doi":"10.1145/2749469.2750390","DOIUrl":"https://doi.org/10.1145/2749469.2750390","url":null,"abstract":"This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"16 1","pages":"118-130"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88352103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Computer performance microscopy with Shim 计算机性能显微镜与Shim

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750401

Xi Yang, S. Blackburn, K. McKinley

Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim's fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim's observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2% overhead at a ~1200 cycle resolution. At an overhead of 61%, Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.

开发人员和架构师花费大量时间试图理解和消除性能问题。不幸的是，许多问题的根本原因发生在现有的连续分析和直接测量方法无法观察到的细粒度上。本文介绍了Shim的设计和实现，Shim是一种连续分析器，采样分辨率可达15个周期;比目前的连续剖面仪精细三到五个数量级。Shim的细粒度测量揭示了新的行为，例如单个函数执行中每周期指令(IPC)的变化。Shim观察者线程在未使用的硬件上自主执行和采样。作为示例，它读取硬件性能计数器和存储软件状态的内存位置。Shim通过自动检测和丢弃受测量偏差影响的样品来提高其精度。我们测量了Shim的观察者效应并展示了如何分析它们。当在单独的核心上时，Shim可以在~1200周期分辨率下以2%的开销连续观察一个软件信号。在61%的开销下，Shim以~15周的分辨率在SMT的同一核心上采样一个软件信号。适度的硬件更改可以显著降低开销，并为Shim增加更大的分析能力。我们在案例研究中改变了预取和DVFS策略，这些研究显示了细粒度IPC和内存带宽结果的诊断能力。通过重新利用现有硬件，我们为开发人员和架构师提供了一个实用的细粒度性能显微镜工具。

{"title":"Computer performance microscopy with Shim","authors":"Xi Yang, S. Blackburn, K. McKinley","doi":"10.1145/2749469.2750401","DOIUrl":"https://doi.org/10.1145/2749469.2750401","url":null,"abstract":"Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim's fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim's observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2% overhead at a ~1200 cycle resolution. At an overhead of 61%, Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"106 1","pages":"170-184"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88115909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Warped-Compression: Enabling power efficient GPUs through register compression warp - compression:通过寄存器压缩使能高效的gpu

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750417

Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, W. Ro, M. Annavaram

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

本文提出了一种用于降低GPU功耗的扭曲级寄存器压缩方案——扭曲压缩。这项工作的动机是观察到同一经纱内线程的寄存器值是相似的，即两个连续线程寄存器之间的算术差异很小。通过寄存器压缩消除寄存器值的数据冗余减少了有效的寄存器宽度，从而实现了降低功耗的机会。GPU寄存器文件是巨大的，因为它们是保持并发执行上下文和实现快速上下文切换所必需的。因此，寄存器文件消耗了GPU芯片总功率的很大一部分。GPU设计趋势表明，寄存器文件的大小将继续增加，以实现更多的线程级并行性。为了减少寄存器文件数据冗余，扭曲压缩使用低成本和实现效率的基础-增量-立即(BDI)压缩方案，该方案利用了gpu中使用的银行寄存器文件组织。由于warp中的线程写入值具有很强的相似性，因此BDI可以通过选择单个寄存器或一个寄存器组作为主要基数，然后计算所有其他寄存器或寄存器组的增量值来快速压缩和解压缩。翘曲压缩可以降低动态功率和泄漏功率。通过压缩寄存器值，每次翘曲级寄存器访问激活较少的寄存器组，从而导致动态功率的降低。当用于存储寄存器内容的存储库较少时，可以通过对未使用的存储库进行电源门控来降低泄漏功率。评估结果表明，寄存器压缩节省了总寄存器文件功耗的25%。

{"title":"Warped-Compression: Enabling power efficient GPUs through register compression","authors":"Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, W. Ro, M. Annavaram","doi":"10.1145/2749469.2750417","DOIUrl":"https://doi.org/10.1145/2749469.2750417","url":null,"abstract":"This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"23 1","pages":"502-514"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87615899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

BlueDBM: An appliance for Big Data analytics BlueDBM:用于大数据分析的设备

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750412

S. Jun, Ming Liu, Sungjin Lee, Jamey Hicks, J. Ankcorn, Myron King, Shuotao Xu, Arvind

Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.

复杂的数据查询，因为需要随机访问，已经被证明是缓慢的，除非所有的数据都可以容纳在DRAM中。有许多领域，如基因组学、地质数据和每日twitter feed，感兴趣的数据集在5TB到20tb之间。对于这样的数据集，需要一个包含100台服务器的集群，每台服务器有128GB到256gb的DRAM，以容纳DRAM中的所有数据。另一方面，这样的数据集可以很容易地存储在机架大小的集群的闪存中。闪存具有比硬盘更好的随机访问性能，这使得它适合分析工作负载。在本文中，我们提出了BlueDBM，一个新的系统架构，它具有基于闪存的存储，具有店内处理能力和低延迟的高吞吐量控制器间网络。我们表明，在一些重要应用中，BlueDBM的性能比没有这些特性的基于闪存的系统高出10倍。虽然ram-cloud系统的性能急剧下降，即使只有5%~10%的引用是二级存储，但这种急剧的性能下降在BlueDBM中不是问题。BlueDBM在大数据分析的成本-性能权衡中呈现出一个有吸引力的点。

{"title":"BlueDBM: An appliance for Big Data analytics","authors":"S. Jun, Ming Liu, Sungjin Lee, Jamey Hicks, J. Ankcorn, Myron King, Shuotao Xu, Arvind","doi":"10.1145/2749469.2750412","DOIUrl":"https://doi.org/10.1145/2749469.2750412","url":null,"abstract":"Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"17 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87494477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 174

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures 共享内存多核架构中刮本存储器透明管理的一致性协议

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2872887.2750411

Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero

The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.

多核体系结构中内核数量的增加会导致内存子系统出现严重的功耗和可伸缩性问题。一种解决方案是在高速缓存层次结构旁边引入刮板存储器，形成混合存储器系统。Scratchpad存储器比缓存更节能，它们不会产生相干流量，但它们的可编程性很差。向程序员隐藏可编程性困难的一个好方法是让编译器负责生成代码来管理临时存储器。不幸的是，在存在未知混叠风险的随机内存访问时，编译器无法成功生成此代码。本文提出了一种用于混合存储系统的一致性协议，该协议允许编译器始终生成代码来管理临时存储器。在与编译器的协调下，识别可能访问过期数据副本的内存访问，并将其转移到数据的有效副本。该建议允许将体系结构作为共享内存多核公开给程序员，维护共享内存模型的编程简单性并保持向后兼容性。在64核多核中，相干协议增加了4%的性能开销、8%的网络流量和9%的能耗，从而使混合内存系统的使用与基于缓存的系统相比，实现了1.14倍的加速，并将片上网络流量和能耗分别降低了29%和17%。

{"title":"Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures","authors":"Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero","doi":"10.1145/2872887.2750411","DOIUrl":"https://doi.org/10.1145/2872887.2750411","url":null,"abstract":"The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"125 1","pages":"720-732"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73307901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32