2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)最新文献

Energy Efficient Architecture for Graph Analytics Accelerators 图形分析加速器的节能架构

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001155

Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, A. Ayupov, John Greth, S. Burns, Özcan Özturk

Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.

专用硬件加速器可以显著提高计算系统的性能和功率效率。在本文中，我们专注于图形分析应用的硬件加速器，并提出了一个可配置的架构模板，该模板专门针对具有不规则访问模式和非对称收敛的迭代顶点中心图形应用进行了优化。提出的架构解决了现有多核CPU和GPU架构对这些类型应用程序的限制。我们提供的基于systemc的模板可以通过插入应用程序级别的数据结构和函数，为不同的以顶点为中心的应用程序轻松定制。然后，可以生成周期精确模拟器和RTL来对目标硬件加速器进行建模。在我们的实验中，我们研究了几个图并行应用程序，并表明由我们的模板生成的硬件加速器在性能方面可以比24核高端服务器CPU系统高出3倍。我们还通过物理感知逻辑合成估计了这些硬件加速器的面积需求和功耗，并显示出在显着减小的面积下功耗提高了65倍。

{"title":"Energy Efficient Architecture for Graph Analytics Accelerators","authors":"Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, A. Ayupov, John Greth, S. Burns, Özcan Özturk","doi":"10.1145/3007787.3001155","DOIUrl":"https://doi.org/10.1145/3007787.3001155","url":null,"abstract":"Specialized hardware accelerators can significantly improve the performance and power efficiency of compute systems. In this paper, we focus on hardware accelerators for graph analytics applications and propose a configurable architecture template that is specifically optimized for iterative vertex-centric graph applications with irregular access patterns and asymmetric convergence. The proposed architecture addresses the limitations of the existing multi-core CPU and GPU architectures for these types of applications. The SystemC-based template we provide can be customized easily for different vertex-centric applications by inserting application-level data structures and functions. After that, a cycle-accurate simulator and RTL can be generated to model the target hardware accelerators. In our experiments, we study several graph-parallel applications, and show that the hardware accelerators generated by our template can outperform a 24 core high end server CPU system by up to 3x in terms of performance. We also estimate the area requirement and power consumption of these hardware accelerators through physical-aware logic synthesis, and show up to 65x better power consumption with significantly smaller area.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"24 1","pages":"166-177"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75220346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 154

Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming 扭曲切片器:通过动态资源分区实现GPU多编程的高效sm内切片

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001161

Qiumin Xu, Hyeran Jeon, Keunsoo Kim, W. Ro, M. Annavaram

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

随着技术的扩展，预计gpu将包含越来越多的计算资源来支持线程级并行性。但是，即使尽了最大的努力，从单个GPU内核，特别是从通用应用程序中，暴露大量线程级并行性将是一项艰巨的挑战。在某些情况下，即使内核中有足够的线程级并行性，也可能没有足够的可用内存带宽来支持如此大规模的并发线程执行。因此，当更多的通用应用程序被移植到GPU上执行时，GPU资源可能会被充分利用。在本文中，我们探讨多编程gpu作为一种方式来解决资源利用不足的问题。对gpu上的多路编程的硬件支持越来越多。在开普勒架构中引入了Hyper-Q，它允许通过数十个硬件队列流调用多个内核。空间多任务被提出用于跨多个内核划分GPU资源。但是分区是在流多处理器(SMs)的粗粒度下完成的，其中每个内核被分配给SMs的一个子集。在本文中，我们提倡跨多个内核对单个SM进行分区，我们称之为内部SM切片。我们探索了各种SM内部切片策略，这些策略在每个SM内对资源进行切片，以便在SM上并发地运行多个内核。我们的研究结果表明，没有一种内部sm切片策略可以为所有应用程序对获得最佳性能。我们提出了warp - slicer，这是一种动态的SM内部切片策略，它使用一种分析方法来计算跨不同内核的SM资源分区，从而最大化性能。该模型依赖于一组简短的在线概要文件运行，以确定当将每个内核的更多线程块分配给一个SM时，每个内核的性能如何变化。该模型考虑了跨多个内核共享资源使用的干扰效应。该模型计算效率高，可以快速确定资源划分，以便在新内核进入系统时进行动态决策。我们证明了所提出的扭曲切片器方法在最小硬件开销的情况下，比基线多路编程方法提高了23%的性能。

{"title":"Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming","authors":"Qiumin Xu, Hyeran Jeon, Keunsoo Kim, W. Ro, M. Annavaram","doi":"10.1145/3007787.3001161","DOIUrl":"https://doi.org/10.1145/3007787.3001161","url":null,"abstract":"As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"6 1","pages":"230-242"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87893209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration 近似加速度质量权衡控制中的统计保证

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001144

Divya Mahajan, Amir Yazdanbaksh, Jongse Park, Bradley Thwaites, H. Esmaeilzadeh

Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator-improving performance and efficiency-or run on the precise core-maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.

通常，一个近似的加速器替换对频繁执行的代码区域的每次调用，而不考虑最终的质量下降。但是，存在很大的决策空间，其中每个调用可以委托给加速器(提高性能和效率)，也可以在精确的核心上运行(保持质量)。在本文中，我们介绍了MITHRA，一个共同设计的硬件软件解决方案，导航这些权衡提供高性能和效率，同时降低最终的质量损失。MITHRA试图确定每个单独的加速器调用是否会导致不希望的质量损失，如果是，则指示处理器运行原始的精确代码。这种识别是一种二元分类任务，需要硬件和软件的内聚协同设计。硬件组件在运行时执行分类，并向软件机制公开一个旋钮，以控制质量权衡。该软件通过解决统计优化问题来调整这个旋钮，该问题最大限度地提高了近似值的好处，同时提供了统计保证，最终的质量水平将以高可信度得到满足。软件使用这个旋钮来调整和训练硬件分类器。我们设计了两个不同的硬件分类器，一个基于表，一个基于神经网络。为了理解这些机制的功效，我们将它们与一个理想的、但不可行的设计——神谕进行比较。结果表明，在95%的置信度下，基于表格的设计可以将90%的未见输入集的最终输出质量损失限制在5%，同时提供2.5倍的加速和2.6倍的能源效率。神经网络设计显示出类似的加速，但效率提高了13%。与基于表的设计相比，oracle的加速提高了26%，效率提高了36%。这些结果表明，MITHRA在接近oracle的范围内执行，并且可以有效地在近似加速度下进行质量权衡。

{"title":"Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration","authors":"Divya Mahajan, Amir Yazdanbaksh, Jongse Park, Bradley Thwaites, H. Esmaeilzadeh","doi":"10.1145/3007787.3001144","DOIUrl":"https://doi.org/10.1145/3007787.3001144","url":null,"abstract":"Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator-improving performance and efficiency-or run on the precise core-maintaining quality. In this paper we introduce MITHRA, a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. MITHRA seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that MITHRA performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"75 1","pages":"66-77"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86063934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity 利用数据值相似性的DRAM通道节能数据编码

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001213

Hoseok Seol, Wongyu Shin, Jaemin Jang, Jungwhan Choi, Jinwoong Suh, L. Kim

As DRAM data bandwidth increases, tremendous energy is dissipated in the DRAM data bus. To reduce the energy consumed in the data bus, DRAM interfaces with symmetric termination, such as Pseudo Open Drain (POD) and Low Voltage Swing Terminated Logic (LVSTL), have been adopted in modern DRAMs. In interfaces using asymmetric termination, the amount of termination energy is proportional to the hamming weight of the data words. In this work, we propose Bitwise Difference Encoding (BD-Encoding), which decreases the hamming weight of data words, leading to a reduction in energy consumption in the modern DRAM data bus. Since smaller hamming weight of the data words also reduces switching activity, switching energy and power noise are also both reduced. BD-Encoding exploits the similarity in data words in the DRAM data bus. We observed that similar data words (i.e. data words whose hamming distance is small) are highly likely to be sent over at similar times. Based on this observation, BD-coder stores the data recently sent over in both the memory controller and DRAMs. Then, BD-coder transfers the bitwise difference between the current data and the most similar data. In an evaluation using SPEC 2006, BD-Encoding using 64 recent data reduced termination energy by 58.3% and switching energy by 45.3%. In addition, 55% of the LdI/dt noise was decreased with BD-Encoding.

随着DRAM数据带宽的增加，大量的能量被消耗在DRAM数据总线上。为了减少数据总线的能量消耗，现代DRAM采用了具有对称终端的DRAM接口，如伪开漏(POD)和低电压摆端逻辑(LVSTL)。在使用非对称终端的接口中，终端能量的大小与数据字的汉明权重成正比。在这项工作中，我们提出了比特差分编码(BD-Encoding)，它减少了数据字的汉明权重，从而降低了现代DRAM数据总线的能耗。由于数据字的锤击权值较小，也降低了开关活动，开关能量和功率噪声也都降低了。bd编码利用了DRAM数据总线中数据字的相似性。我们观察到相似的数据词(即汉明距离较小的数据词)很可能在相似的时间发送。基于这种观察，bd编码器将最近发送的数据存储在内存控制器和dram中。然后，BD-coder传输当前数据和最相似数据之间的位差。在使用spec2006的评估中，使用64个最新数据的bd编码减少了58.3%的终端能量和45.3%的开关能量。此外，采用bd编码可以降低55%的LdI/dt噪声。

{"title":"Energy Efficient Data Encoding in DRAM Channels Exploiting Data Value Similarity","authors":"Hoseok Seol, Wongyu Shin, Jaemin Jang, Jungwhan Choi, Jinwoong Suh, L. Kim","doi":"10.1145/3007787.3001213","DOIUrl":"https://doi.org/10.1145/3007787.3001213","url":null,"abstract":"As DRAM data bandwidth increases, tremendous energy is dissipated in the DRAM data bus. To reduce the energy consumed in the data bus, DRAM interfaces with symmetric termination, such as Pseudo Open Drain (POD) and Low Voltage Swing Terminated Logic (LVSTL), have been adopted in modern DRAMs. In interfaces using asymmetric termination, the amount of termination energy is proportional to the hamming weight of the data words. In this work, we propose Bitwise Difference Encoding (BD-Encoding), which decreases the hamming weight of data words, leading to a reduction in energy consumption in the modern DRAM data bus. Since smaller hamming weight of the data words also reduces switching activity, switching energy and power noise are also both reduced. BD-Encoding exploits the similarity in data words in the DRAM data bus. We observed that similar data words (i.e. data words whose hamming distance is small) are highly likely to be sent over at similar times. Based on this observation, BD-coder stores the data recently sent over in both the memory controller and DRAMs. Then, BD-coder transfers the bitwise difference between the current data and the most similar data. In an evaluation using SPEC 2006, BD-Encoding using 64 recent data reduced termination energy by 58.3% and switching energy by 45.3%. In addition, 55% of the LdI/dt noise was decreased with BD-Encoding.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"719-730"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88769640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures PowerChop:在混合处理器架构中识别和管理非关键单元

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001152

M. Laurenzano, Yunqi Zhang, Jiang Chen, Lingjia Tang, Jason Mars

On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.

核心微架构结构消耗了处理器功率预算的很大一部分。然而，根据应用程序的特点，这些结构并不总是提供(很多)性能优势。虽然基于超时的功率门控技术已被用于未充分利用的核心和不活跃的功能单元，但这些技术并没有直接转化为高活动单元，如矢量处理单元、复杂分支预测器和缓存。这些单元提供的性能优势并不一定与单元活动相对应，而是应用程序特征的函数。这项工作介绍了PowerChop，这是一种利用硬件/软件共同设计的混合处理器的独特功能在应用阶段级实施单元级电源管理的新技术。PowerChop增加了两个额外的小硬件单元，以促进相位识别和触发不同的电源状态，使软件层能够通过为性能执行不需要的门控单元供电，低成本地跟踪、预测和利用不同应用阶段的单元临界性。通过详细的实验，我们发现PowerChop显著降低了功耗，混合服务器处理器的泄漏功率平均降低了9%(高达33%)，混合移动处理器的泄漏功率平均降低了19%(高达40%)，而只引入了2%的减速。

{"title":"PowerChop: Identifying and Managing Non-critical Units in Hybrid Processor Architectures","authors":"M. Laurenzano, Yunqi Zhang, Jiang Chen, Lingjia Tang, Jason Mars","doi":"10.1145/3007787.3001152","DOIUrl":"https://doi.org/10.1145/3007787.3001152","url":null,"abstract":"On-core microarchitectural structures consume significant portions of a processor's power budget. However, depending on application characteristics, those structures do not always provide (much) performance benefit. While timeout-based power gating techniques have been leveraged for underutilized cores and inactive functional units, these techniques have not directly translated to high-activity units such as vector processing units, complex branch predictors, and caches. The performance benefit provided by these units does not necessarily correspond with unit activity, but instead is a function of application characteristics. This work introduces PowerChop, a novel technique that leverages the unique capabilities of HW/SW co-designed hybrid processors to enact unit-level power management at the application phase level. PowerChop adds two small additional hardware units to facilitate phase identification and triggering different power states, enabling the software layer to cheaply track, predict and take advantage of varying unit criticality across application phases by powering gating units that are not needed for performant execution. Through detailed experimentation, we find that PowerChop significantly decreases power consumption, reducing the leakage power of a hybrid server processor by 9% on average (up to 33%) and a hybrid mobile processor by 19% (up to 40%) while introducing just 2% slowdown.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"59 1","pages":"140-152"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91128949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures 位平面压缩:在多核架构中转换数据以获得更好的压缩

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001172

Jungrae Kim, Michael B. Sullivan, Esha Choukse, M. Erez

As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweight compression algorithm, Bit-Plane Compression (BPC), to increase the effective memory bandwidth. BPC aims at homogeneously-typed memory blocks, which are prevalent in many-core architectures, and applies a smart data transformation to both improve the inherent data compressibility and to reduce the complexity of compression hardware. We demonstrate that BPC provides superior compression ratios of 4.1:1 for integer benchmarks and reduces memory bandwidth requirements significantly.

随着关键应用程序变得更加数据密集，处理器的计算吞吐量增加，需要在现代内存子系统中传输的数据量也在增加。然而，由于严格的面积和能源限制，增加物理带宽以跟上需求增长是具有挑战性的。为了提高有效内存带宽，本文提出了一种新的轻量级压缩算法——位平面压缩(BPC)。BPC针对多核架构中普遍存在的同质类型内存块，并应用智能数据转换来提高固有的数据可压缩性并降低压缩硬件的复杂性。我们证明了BPC为整数基准测试提供了4.1:1的优越压缩比，并显着降低了内存带宽需求。

引用次数: 73

Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems 透明卸载和映射(TOM):在GPU系统中启用程序员透明的近数据处理

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001159

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, O. Mutlu, S. Keckler

Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer. Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping. Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

由于片外引脚带宽有限，主存储器带宽是现代GPU系统的关键瓶颈。3d堆叠存储器架构提供了一个很有希望的机会，通过高带宽连接将逻辑层直接连接到DRAM层，从而显著缓解这一瓶颈。最近的研究表明，连接多个这样的3d堆叠存储器并将带宽密集型计算卸载到每个逻辑层的GPU的架构具有潜在的性能优势。在这样一个系统中，一个未解决的关键挑战是如何使计算卸载和数据映射到多个3d堆叠内存，而不增加程序员的负担，这样任何应用程序都可以透明地受益于逻辑层中的近数据处理能力。我们的论文开发了两种新的机制来解决这一关键挑战。首先，一种基于编译器的技术，可以根据简单的成本效益分析自动识别要卸载到逻辑层GPU的代码。第二，一个软件/硬件合作机制，它预测卸载代码将访问哪些内存页面，并将这些页面放在离卸载代码最近的内存堆栈中，以最小化片外带宽消耗。我们将这两种程序员透明机制的组合称为TOM:透明卸载和透明映射。我们对各种现代内存密集型GPU工作负载的广泛评估表明，与无法将计算卸载到3d堆叠内存的基准GPU系统相比，无需任何程序修改，TOM显着提高了性能(平均提高30%，最高可达76%)。

{"title":"Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems","authors":"Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, O. Mutlu, S. Keckler","doi":"10.1145/3007787.3001159","DOIUrl":"https://doi.org/10.1145/3007787.3001159","url":null,"abstract":"Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer. Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping. Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"204-216"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74704492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 234

Accelerating Dependent Cache Misses with an Enhanced Memory Controller 使用增强型内存控制器加速依赖缓存缺失

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001184

Milad Hashemi, Khubaib, Eiman Ebrahimi, O. Mutlu, Y. Patt

On-chip contention increases memory access latency for multi-core processors. We identify that this additional latency has a substantial effect on performance for an important class of latency-critical memory operations: those that result in a cache miss and are dependent on data from a prior cache miss. We observe that the number of instructions between the first cache miss and its dependent cache miss is usually small. To minimize dependent cache miss latency, we propose adding just enough functionality to dynamically identify these instructions at the core and migrate them to the memory controller for execution as soon as source data arrives from DRAM. This migration allows memory requests issued by our new Enhanced Memory Controller (EMC) to experience a 20% lower latency than if issued by the core. On a set of memory intensive quad-core workloads, the EMC results in a 13% improvement in system performance and a 5% reduction in energy consumption over a system with a Global History Buffer prefetcher, the highest performing prefetcher in our evaluation.

片上争用增加了多核处理器的内存访问延迟。我们发现，这种额外的延迟对一类重要的延迟关键内存操作的性能有重大影响:那些导致缓存丢失并且依赖于先前缓存丢失的数据的内存操作。我们观察到，第一次缓存丢失与其依赖的缓存丢失之间的指令数量通常很小。为了最小化依赖缓存丢失延迟，我们建议添加足够的功能来动态识别核心中的这些指令，并在源数据从DRAM到达时将它们迁移到内存控制器中执行。这种迁移允许我们的新增强型内存控制器(EMC)发出的内存请求比内核发出的请求延迟低20%。在一组内存密集型四核工作负载上，与使用Global History Buffer预取器(我们评估中性能最高的预取器)的系统相比，EMC使系统性能提高了13%，能耗降低了5%。

引用次数: 94

LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches LAP:节能非对称最后一级缓存的环块感知包含特性

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001148

Hsiang-Yun Cheng, Jishen Zhao, J. Sampson, M. J. Irwin, A. Jaleel, Yu Lu, Yuan Xie

Emerging non-volatile memory (NVM) technologies, such as spin-transfer torque RAM (STT-RAM), are attractive options for replacing or augmenting SRAM in implementing last-level caches (LLCs). However, the asymmetric read/write energy and latency associated with NVM introduces new challenges in designing caches where, in contrast to SRAM, dynamic energy from write operations can be responsible for a larger fraction of total cache energy than leakage. These properties lead to the fact that no single traditional inclusion policy being dominant in terms of LLC energy consumption for asymmetric LLCs. We propose a novel selective inclusion policy, Loop-block-Aware Policy (LAP), to reduce energy consumption in LLCs with asymmetric read/write properties. In order to eliminate redundant writes to the LLC, LAP incorporates advantages from both non-inclusive and exclusive designs to selectively cache only part of upper-level data in the LLC. Results show that LAP outperforms other variants of selective inclusion policies and consumes 20% and 12% less energy than non-inclusive and exclusive STT-RAM-based LLCs, respectively. We extend LAP to a system with SRAM/STT-RAM hybrid LLCs to achieve energy-efficient data placement, reducing the energy consumption by 22% and 15% over non-inclusion and exclusion on average, with average-case performance improvements, small worst-case performance loss, and minimal hardware overheads.

新兴的非易失性存储器(NVM)技术，如自旋传递扭矩RAM (STT-RAM)，是在实现最后一级缓存(LLCs)中取代或增强SRAM的有吸引力的选择。然而，与NVM相关的非对称读/写能量和延迟给设计缓存带来了新的挑战，与SRAM相比，写操作产生的动态能量可能占总缓存能量的更大比例，而不是泄漏。这些特性导致了这样一个事实，即在不对称有限责任公司的有限责任公司能源消耗方面，没有单一的传统包容政策占主导地位。我们提出了一种新的选择性包含策略，环路块感知策略(LAP)，以减少具有不对称读/写属性的llc的能源消耗。为了消除对LLC的冗余写入，LAP结合了非包容和独占设计的优势，选择性地只缓存LLC中的部分上层数据。结果表明，LAP优于其他选择性包容策略，比基于非包容和独占的stt - ram的LLC分别消耗20%和12%的能量。我们将LAP扩展到具有SRAM/STT-RAM混合llc的系统中，以实现节能的数据放置，比不包含和排除平均减少22%和15%的能耗，平均情况下性能提高，最坏情况下性能损失小，硬件开销最小。

{"title":"LAP: Loop-Block Aware Inclusion Properties for Energy-Efficient Asymmetric Last Level Caches","authors":"Hsiang-Yun Cheng, Jishen Zhao, J. Sampson, M. J. Irwin, A. Jaleel, Yu Lu, Yuan Xie","doi":"10.1145/3007787.3001148","DOIUrl":"https://doi.org/10.1145/3007787.3001148","url":null,"abstract":"Emerging non-volatile memory (NVM) technologies, such as spin-transfer torque RAM (STT-RAM), are attractive options for replacing or augmenting SRAM in implementing last-level caches (LLCs). However, the asymmetric read/write energy and latency associated with NVM introduces new challenges in designing caches where, in contrast to SRAM, dynamic energy from write operations can be responsible for a larger fraction of total cache energy than leakage. These properties lead to the fact that no single traditional inclusion policy being dominant in terms of LLC energy consumption for asymmetric LLCs. We propose a novel selective inclusion policy, Loop-block-Aware Policy (LAP), to reduce energy consumption in LLCs with asymmetric read/write properties. In order to eliminate redundant writes to the LLC, LAP incorporates advantages from both non-inclusive and exclusive designs to selectively cache only part of upper-level data in the LLC. Results show that LAP outperforms other variants of selective inclusion policies and consumes 20% and 12% less energy than non-inclusive and exclusive STT-RAM-based LLCs, respectively. We extend LAP to a system with SRAM/STT-RAM hybrid LLCs to achieve energy-efficient data placement, reducing the energy consumption by 22% and 15% over non-inclusion and exclusion on average, with average-case performance improvements, small worst-case performance loss, and minimal hardware overheads.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"103-114"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85267977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs 圆熟写入:通过选择性慢速回写延长电阻性记忆的寿命

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2016-06-18 DOI: 10.1145/3007787.3001192

Lunkai Zhang, Brian Neely, D. Franklin, D. Strukov, Yuan Xie, F. Chong

Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major drawback for such resistive memory technologies. Wear leveling (balancing the distribution of writes) and wear limiting (reducing the number of writes) have been proposed to mitigate this disadvantage, but both techniques only manage a fixed budget of writes to a memory system rather than increase the number available. In this paper, we propose a new type of wear limiting technique, Mellow Writes, which reduces the wearout of individual writes rather than reducing the number of writes. Mellow Writes is based on the fact that slow writes performed with lower dissipated power can lead to longer endurance (and therefore longer lifetimes). For non-volatile memories, an N1 to N3 times endurance can be achieved if the write operation is slowed down by N times. We present three microarchitectural mechanisms (BankAware Mellow Writes, Eager Mellow Writes, and Wear Quota) that selectively perform slow writes to increase memory lifetime while minimizing performance impact. Assuming a factor N2 advantage in cell endurance for a factor N slower write, our best Mellow Writes mechanism can achieve 2.58× lifetime and 1.06× performance of the baseline system. In addition, its performance is almost the same as a system aggressively optimized for performance (at the expense of endurance). Finally, Wear Quota guarantees a minimal lifetime (e.g., 8 years) by forcing more slow writes in presence of heavy workloads. We also perform sensitivity analysis on the endurance advantage factor for slow writes, from N1 to N3, and find that our technique is still useful for factors as low as N1.

新兴的电阻存储器技术，如PCRAM和ReRAM，由于其更好的可扩展性、低待机功耗和非易失性，已被提出作为基于dram的主存储器的有希望的替代品。然而，有限的写入持久性是这种电阻式存储技术的主要缺点。磨损均衡(平衡写的分布)和磨损限制(减少写的数量)已经被提出来减轻这个缺点，但是这两种技术只管理一个固定的内存系统写预算，而不是增加可用的数量。在本文中，我们提出了一种新型的磨损限制技术，即Mellow Writes，它减少了单个写的磨损，而不是减少写的次数。Mellow Writes基于这样一个事实，即以较低的耗散功率执行缓慢的写操作可以导致更长的持久时间(从而延长生命周期)。对于非易失性存储器，如果写入操作减慢N倍，则可以实现N1到N3倍的持久时间。我们提出了三种微架构机制(BankAware Mellow Writes、Eager Mellow Writes和磨损配额)，它们选择性地执行慢写以增加内存寿命，同时最大限度地减少性能影响。假设电池续航时间为N2倍，而写入速度为N倍，那么我们最好的Mellow Writes机制可以实现2.58倍的寿命和1.06倍的性能。此外，它的性能几乎与积极优化性能的系统相同(以牺牲续航时间为代价)。最后，磨损配额通过在出现繁重工作负载时强制更慢的写入来保证最小的生命周期(例如，8年)。我们还对从N1到N3的慢写的耐力优势因素进行了敏感性分析，发现我们的技术对于低至N1的因素仍然有用。

{"title":"Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs","authors":"Lunkai Zhang, Brian Neely, D. Franklin, D. Strukov, Yuan Xie, F. Chong","doi":"10.1145/3007787.3001192","DOIUrl":"https://doi.org/10.1145/3007787.3001192","url":null,"abstract":"Emerging resistive memory technologies, such as PCRAM and ReRAM, have been proposed as promising replacements for DRAM-based main memory, due to their better scalability, low standby power, and non-volatility. However, limited write endurance is a major drawback for such resistive memory technologies. Wear leveling (balancing the distribution of writes) and wear limiting (reducing the number of writes) have been proposed to mitigate this disadvantage, but both techniques only manage a fixed budget of writes to a memory system rather than increase the number available. In this paper, we propose a new type of wear limiting technique, Mellow Writes, which reduces the wearout of individual writes rather than reducing the number of writes. Mellow Writes is based on the fact that slow writes performed with lower dissipated power can lead to longer endurance (and therefore longer lifetimes). For non-volatile memories, an N1 to N3 times endurance can be achieved if the write operation is slowed down by N times. We present three microarchitectural mechanisms (BankAware Mellow Writes, Eager Mellow Writes, and Wear Quota) that selectively perform slow writes to increase memory lifetime while minimizing performance impact. Assuming a factor N2 advantage in cell endurance for a factor N slower write, our best Mellow Writes mechanism can achieve 2.58× lifetime and 1.06× performance of the baseline system. In addition, its performance is almost the same as a system aggressively optimized for performance (at the expense of endurance). Finally, Wear Quota guarantees a minimal lifetime (e.g., 8 years) by forcing more slow writes in presence of heavy workloads. We also perform sensitivity analysis on the endurance advantage factor for slow writes, from N1 to N3, and find that our technique is still useful for factors as low as N1.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"202 1","pages":"519-531"},"PeriodicalIF":0.0,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89639609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 84