2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献_第2页

Improving cache performance using read-write partitioning 使用读写分区提高缓存性能

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835954

S. Khan, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu, Daniel A. Jiménez

Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.

如果没有独立的指令要执行，缓存读取失败会使处理器停滞。相反，大多数缓存写失败都发生在执行的关键路径之外，因为写可以缓冲在缓存或存储缓冲区中。除了少数例外，服务于负载的缓存线比仅服务于存储的缓存线对性能更为关键。不幸的是，传统的缓存管理机制并没有考虑到读写临界性之间的差异。本文提出了一种读写分区(RWP)策略，该策略通过将缓存动态划分为干净分区和脏分区来最大限度地减少读丢失，如果分区更有可能接收到未来的读请求，则分区的大小会增加。我们表明，利用读写临界性的差异可以提供比以前的缓存管理机制更好的性能。对于单核系统，相对于基准LRU替换策略，RWP在整个SPEC CPU2006套件中提供5%的平均加速，在缓存敏感基准测试中提供14%的平均加速。我们还表明，RWP可以在3%的范围内执行一种新的复杂的基于指令地址的技术，读参考预测器(RRP)，它绕过不太可能接收任何读请求的缓存线，而只需要5.4%的RRP状态开销。在4核系统上，我们的RWP机制将系统吞吐量提高了6%，并且优于我们评估的其他三种最先进的机制。

{"title":"Improving cache performance using read-write partitioning","authors":"S. Khan, Alaa R. Alameldeen, C. Wilkerson, O. Mutlu, Daniel A. Jiménez","doi":"10.1109/HPCA.2014.6835954","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835954","url":null,"abstract":"Cache read misses stall the processor if there are no independent instructions to execute. In contrast, most cache write misses are off the critical path of execution, since writes can be buffered in the cache or the store buffer. With few exceptions, cache lines that serve loads are more critical for performance than cache lines that serve only stores. Unfortunately, traditional cache management mechanisms do not take into account this disparity between read-write criticality. This paper proposes a Read-Write Partitioning (RWP) policy that minimizes read misses by dynamically partitioning the cache into clean and dirty partitions, where partitions grow in size if they are more likely to receive future read requests. We show that exploiting the differences in read-write criticality provides better performance over prior cache management mechanisms. For a single-core system, RWP provides 5% average speedup across the entire SPEC CPU2006 suite, and 14% average speedup for cache-sensitive benchmarks, over the baseline LRU replacement policy. We also show that RWP can perform within 3% of a new yet complex instruction-address-based technique, Read Reference Predictor (RRP), that bypasses cache lines which are unlikely to receive any read requests, while requiring only 5.4% of RRP's state overhead. On a 4-core system, our RWP mechanism improves system throughput by 6% over the baseline and outperforms three other state-of-the-art mechanisms we evaluate.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122825771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Exploiting thermal energy storage to reduce data center capital and operating expenses 利用热能储存来减少数据中心的资本和运营费用

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835924

Wenli Zheng, Kai Ma, Xiaorui Wang

Power shaving has recently been proposed to dynamically shave the power peaks of a data center with energy storage devices (ESD), such that more servers can be safely hosted. In addition to the reduction of capital investment (cap-ex), power shaving also helps cut the electricity bills (op-ex) of a data center by reducing the high utility tariffs related to peak power. However, existing work on power shaving focuses exclusively on electrical ESDs (e.g., UPS batteries) to shave the server-side power demand. In this paper, we propose TE-Shave, a generalized power shaving framework that exploits both UPS batteries and a new knob, thermal energy storage (TES) tanks equipped in many data centers. Specifically, TE-Shave utilizes stored cold water or ice to manipulate the cooling power, which accounts for 30-40% of the total power cost of a data center. Our extensive evaluation with real-world workload traces shows that TE-Shave saves cap-ex and op-ex up to $2,668/day and $825/day, respectively, for a data center with 17,920 servers. Even for future data centers that are projected to have more efficient cooling and thus a smaller portion of cooling power, e.g., a quarter of today's level, TE-Shave still leads to 28% more savings than existing work that focuses only on the server-side power. TE-Shave is also coordinated with traditional TES solutions for further reduced op-ex.

最近提出的剃电方案是通过能源存储设备(ESD)动态地剃除数据中心的功率峰值，这样就可以安全地托管更多的服务器。除了减少资本投资(cap-ex)外，通过降低与峰值电力相关的高公用事业费用，剃电还有助于减少数据中心的电费(op-ex)。然而，现有的电力削减工作主要集中在电子esd(例如UPS电池)上，以削减服务器端的电力需求。在本文中，我们提出了TE-Shave，这是一个通用的电力剃须框架，它利用了UPS电池和许多数据中心配备的新按钮，即热储能(TES)罐。具体来说，TE-Shave利用储存的冷水或冰来控制冷却功率，这占数据中心总电力成本的30-40%。我们对实际工作负载跟踪的广泛评估表明，对于拥有17,920台服务器的数据中心，TE-Shave可以分别节省2,668美元/天和825美元/天的cap-ex和op-ex。即使未来的数据中心预计会有更高效的冷却，从而减少冷却功率的一部分，例如，只有今天水平的四分之一，TE-Shave仍然比现有的只关注服务器端功率的工作多节省28%。TE-Shave还与传统的TES解决方案相协调，以进一步降低op-ex。

{"title":"Exploiting thermal energy storage to reduce data center capital and operating expenses","authors":"Wenli Zheng, Kai Ma, Xiaorui Wang","doi":"10.1109/HPCA.2014.6835924","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835924","url":null,"abstract":"Power shaving has recently been proposed to dynamically shave the power peaks of a data center with energy storage devices (ESD), such that more servers can be safely hosted. In addition to the reduction of capital investment (cap-ex), power shaving also helps cut the electricity bills (op-ex) of a data center by reducing the high utility tariffs related to peak power. However, existing work on power shaving focuses exclusively on electrical ESDs (e.g., UPS batteries) to shave the server-side power demand. In this paper, we propose TE-Shave, a generalized power shaving framework that exploits both UPS batteries and a new knob, thermal energy storage (TES) tanks equipped in many data centers. Specifically, TE-Shave utilizes stored cold water or ice to manipulate the cooling power, which accounts for 30-40% of the total power cost of a data center. Our extensive evaluation with real-world workload traces shows that TE-Shave saves cap-ex and op-ex up to $2,668/day and $825/day, respectively, for a data center with 17,920 servers. Even for future data centers that are projected to have more efficient cooling and thus a smaller portion of cooling power, e.g., a quarter of today's level, TE-Shave still leads to 28% more savings than existing work that focuses only on the server-side power. TE-Shave is also coordinated with traditional TES solutions for further reduced op-ex.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126519618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 59

Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management 自力更生:CMP非核心电源管理的人工神经网络在线学习

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835941

Jae-Yeon Won, X. Chen, Paul V. Gratz, Jiang Hu, V. Soteriou

With increasing core counts in Chip Multi-Processor (CMP) designs, the size of the on-chip communication fabric and shared Last-Level Caches (LLC), which we term uncore here, is also growing, consuming as much as 30% of die area and a significant portion of chip power budget. In this work, we focus on improving the uncore energy-efficiency using dynamic voltage and frequency scaling. Previous approaches are mostly restricted to reactive techniques, which may respond poorly to abrupt workload and uncore utility changes. We find, however, there are predictable patterns in uncore utility which point towards the potential of a proactive approach to uncore power management. In this work, we utilize artificial intelligence principles to proactively leverage uncore utility pattern prediction via an Artificial Neural Network (ANN). ANNs, however, require training to produce accurate predictions. Architecting an efficient training mechanism without a priori knowledge of the workload is a major challenge. We propose a novel technique in which a simple Proportional Integral (PI) controller is used as a secondary classifier during ANN training, dynamically pulling the ANN up by its bootstraps to achieve accurate predictions. Both the ANN and the PI controller, then, work in tandem once the ANN training phase is complete. The advantage of using a PI controller to initially train the ANN is a dramatic acceleration of the ANN's initial learning phase. Thus, in a real system, this scenario allows quick power-control adaptation to rapid application phase changes and context switches during execution. We show that the proposed technique produces results comparable to those of pure offline training without a need for prerecorded training sets. Full system simulations using the PARSEC benchmark suite show that the bootstrapped ANN improves the energy-delay product of the uncore system by 27% versus existing state-of-the-art methodologies.

随着芯片多处理器(CMP)设计中核心数量的增加，片上通信结构和共享的最后一级缓存(LLC)的尺寸也在增长，消耗了多达30%的芯片面积和很大一部分芯片功耗预算。在这项工作中，我们专注于使用动态电压和频率缩放来提高非核心能源效率。以前的方法大多局限于响应式技术，这些技术可能对突然的工作负载和非核心实用程序更改响应较差。然而，我们发现，在非核心实用程序中有一些可预测的模式，这些模式指向了一种前瞻性非核心电源管理方法的潜力。在这项工作中，我们利用人工智能原理，通过人工神经网络(ANN)主动利用非核心效用模式预测。然而，人工神经网络需要经过训练才能产生准确的预测。在没有对工作量的先验知识的情况下构建有效的培训机制是一个主要的挑战。我们提出了一种新的技术，在人工神经网络训练中使用简单的比例积分(PI)控制器作为二级分类器，通过其自举动态地拉起人工神经网络以实现准确的预测。一旦人工神经网络的训练阶段完成，人工神经网络和PI控制器就会协同工作。使用PI控制器初始训练人工神经网络的优点是人工神经网络初始学习阶段的显著加速。因此，在实际系统中，此场景允许快速的功率控制适应执行期间快速的应用程序阶段更改和上下文切换。我们表明，所提出的技术产生的结果与纯离线训练的结果相当，而不需要预先录制的训练集。使用PARSEC基准测试套件的全系统模拟表明，与现有的最先进方法相比，自引导人工神经网络将非核心系统的能量延迟产品提高了27%。

{"title":"Up by their bootstraps: Online learning in Artificial Neural Networks for CMP uncore power management","authors":"Jae-Yeon Won, X. Chen, Paul V. Gratz, Jiang Hu, V. Soteriou","doi":"10.1109/HPCA.2014.6835941","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835941","url":null,"abstract":"With increasing core counts in Chip Multi-Processor (CMP) designs, the size of the on-chip communication fabric and shared Last-Level Caches (LLC), which we term uncore here, is also growing, consuming as much as 30% of die area and a significant portion of chip power budget. In this work, we focus on improving the uncore energy-efficiency using dynamic voltage and frequency scaling. Previous approaches are mostly restricted to reactive techniques, which may respond poorly to abrupt workload and uncore utility changes. We find, however, there are predictable patterns in uncore utility which point towards the potential of a proactive approach to uncore power management. In this work, we utilize artificial intelligence principles to proactively leverage uncore utility pattern prediction via an Artificial Neural Network (ANN). ANNs, however, require training to produce accurate predictions. Architecting an efficient training mechanism without a priori knowledge of the workload is a major challenge. We propose a novel technique in which a simple Proportional Integral (PI) controller is used as a secondary classifier during ANN training, dynamically pulling the ANN up by its bootstraps to achieve accurate predictions. Both the ANN and the PI controller, then, work in tandem once the ANN training phase is complete. The advantage of using a PI controller to initially train the ANN is a dramatic acceleration of the ANN's initial learning phase. Thus, in a real system, this scenario allows quick power-control adaptation to rapid application phase changes and context switches during execution. We show that the proposed technique produces results comparable to those of pure offline training without a need for prerecorded training sets. Full system simulations using the PARSEC benchmark suite show that the bootstrapped ANN improves the energy-delay product of the uncore system by 27% versus existing state-of-the-art methodologies.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132438906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management GPUdmm:使用动态内存管理的高性能、内存无关的GPU架构

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835963

Youngsok Kim, Jaewon Lee, Jae-Eon Jo, Jangwoo Kim

GPU programmers suffer from programmer-managed GPU memory because both performance and programmability heavily depend on GPU memory allocation and CPU-GPU data transfer mechanisms. To improve performance and programmability, programmers should be able to place only the data frequently accessed by GPU on GPU memory while overlapping CPU-GPU data transfers and GPU executions as much as possible. However, current GPU architectures and programming models blindly place entire data on GPU memory, requiring a significantly large GPU memory size. Otherwise, they must trigger unnecessary CPU-GPU data transfers due to an insufficient GPU memory size. In this paper, we propose GPUdmm, a novel GPU architecture to enable high-performance and memory-oblivious GPU programming. First, GPUdmm uses GPU memory as a cache of CPU memory to provide programmers a view of the CPU memory-sized programming space. Second, GPUdmm achieves high performance by exploiting data locality and dynamically transferring data between CPU and GPU memories while effectively overlapping CPU-GPU data transfers and GPU executions. Third, GPUdmm can further reduce unnecessary CPU-GPU data transfers by exploiting simple programmer hints. Our carefully designed and validated experiments (e.g., PCIe/DMA timing) against representative benchmarks show that GPUdmm can achieve up to five times higher performance for the same GPU memory size, or reduce the GPU memory size requirement by up to 75% while maintaining the same performance.

GPU程序员要忍受程序员管理的GPU内存，因为性能和可编程性都严重依赖于GPU内存分配和CPU-GPU数据传输机制。为了提高性能和可编程性，程序员应该能够只将GPU经常访问的数据放在GPU内存上，同时尽可能多地重叠CPU-GPU数据传输和GPU执行。然而，当前的GPU架构和编程模型盲目地将整个数据放在GPU内存上，这需要很大的GPU内存。否则，由于GPU内存不足，会触发不必要的CPU-GPU数据传输。在本文中，我们提出了一种新的GPU架构GPUdmm，以实现高性能和内存无关的GPU编程。首先，GPUdmm使用GPU内存作为CPU内存的缓存，为程序员提供CPU内存大小的编程空间的视图。其次，GPUdmm通过利用数据局部性和在CPU和GPU存储器之间动态传输数据来实现高性能，同时有效地重叠CPU-GPU数据传输和GPU执行。第三，GPUdmm可以通过利用简单的程序员提示进一步减少不必要的CPU-GPU数据传输。我们经过精心设计和验证的实验(例如，PCIe/DMA计时)针对代表性基准测试表明，GPUdmm可以在相同GPU内存大小的情况下实现高达五倍的性能提升，或者在保持相同性能的同时将GPU内存大小要求降低高达75%。

{"title":"GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management","authors":"Youngsok Kim, Jaewon Lee, Jae-Eon Jo, Jangwoo Kim","doi":"10.1109/HPCA.2014.6835963","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835963","url":null,"abstract":"GPU programmers suffer from programmer-managed GPU memory because both performance and programmability heavily depend on GPU memory allocation and CPU-GPU data transfer mechanisms. To improve performance and programmability, programmers should be able to place only the data frequently accessed by GPU on GPU memory while overlapping CPU-GPU data transfers and GPU executions as much as possible. However, current GPU architectures and programming models blindly place entire data on GPU memory, requiring a significantly large GPU memory size. Otherwise, they must trigger unnecessary CPU-GPU data transfers due to an insufficient GPU memory size. In this paper, we propose GPUdmm, a novel GPU architecture to enable high-performance and memory-oblivious GPU programming. First, GPUdmm uses GPU memory as a cache of CPU memory to provide programmers a view of the CPU memory-sized programming space. Second, GPUdmm achieves high performance by exploiting data locality and dynamically transferring data between CPU and GPU memories while effectively overlapping CPU-GPU data transfers and GPU executions. Third, GPUdmm can further reduce unnecessary CPU-GPU data transfers by exploiting simple programmer hints. Our carefully designed and validated experiments (e.g., PCIe/DMA timing) against representative benchmarks show that GPUdmm can achieve up to five times higher performance for the same GPU memory size, or reduce the GPU memory size requirement by up to 75% while maintaining the same performance.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131669015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

QuickRelease: A throughput-oriented approach to release consistency on GPUs QuickRelease:一种在gpu上释放一致性的面向吞吐量的方法

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835930

Blake A. Hechtman, Shuai Che, Derek Hower, Yingying Tian, Bradford M. Beckmann, M. Hill, S. Reinhardt, D. Wood

Graphics processing units (GPUs) have specialized throughput-oriented memory systems optimized for streaming writes with scratchpad memories to capture locality explicitly. Expanding the utility of GPUs beyond graphics encourages designs that simplify programming (e.g., using caches instead of scratchpads) and better support irregular applications with finer-grain synchronization. Our hypothesis is that, like CPUs, GPUs will benefit from caches and coherence, but that CPU-style “read for ownership” (RFO) coherence is inappropriate to maintain support for regular streaming workloads. This paper proposes QuickRelease (QR), which improves on conventional GPU memory systems in two ways. First, QR uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes. Thus, non-synchronizing threads in QR can re-use cached data even when other threads are performing synchronization. Second, QR partitions the resources required by reads and writes to reduce the penalty of writes on read performance. Simulation results across a wide variety of general-purpose GPU workloads show that QR achieves a 7% average performance improvement compared to a conventional GPU memory system. Furthermore, for emerging workloads with finer-grain synchronization, QR achieves up to 42% performance improvement compared to a conventional GPU memory system without the scalability challenges of RFO coherence. To this end, QR provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.

图形处理单元(gpu)具有专门的面向吞吐量的内存系统，该系统针对带有刮板存储器的流写入进行了优化，以显式地捕获局部性。将gpu的功能扩展到图形之外，可以鼓励简化编程的设计(例如，使用缓存而不是scratchpad)，并更好地支持具有细粒度同步的不规则应用程序。我们的假设是，像cpu一样，gpu将受益于缓存和一致性，但是cpu风格的“读取所有权”(RFO)一致性不适合维持对常规流工作负载的支持。本文提出了QuickRelease (QR)技术，它从两个方面改进了传统的GPU存储系统。首先，QR使用FIFO来强制写操作的部分顺序，这样同步操作就可以在不频繁刷新缓存的情况下完成。因此，即使其他线程正在执行同步，QR中的非同步线程也可以重用缓存的数据。其次，QR对读和写所需的资源进行分区，以减少写对读性能的影响。各种通用GPU工作负载的仿真结果表明，与传统GPU内存系统相比，QR实现了7%的平均性能提升。此外，对于具有细粒度同步的新兴工作负载，与传统GPU内存系统相比，QR实现了高达42%的性能提升，而没有RFO一致性的可扩展性挑战。为此，QR提供了面向吞吐量的解决方案，在gpu上提供细粒度同步。

{"title":"QuickRelease: A throughput-oriented approach to release consistency on GPUs","authors":"Blake A. Hechtman, Shuai Che, Derek Hower, Yingying Tian, Bradford M. Beckmann, M. Hill, S. Reinhardt, D. Wood","doi":"10.1109/HPCA.2014.6835930","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835930","url":null,"abstract":"Graphics processing units (GPUs) have specialized throughput-oriented memory systems optimized for streaming writes with scratchpad memories to capture locality explicitly. Expanding the utility of GPUs beyond graphics encourages designs that simplify programming (e.g., using caches instead of scratchpads) and better support irregular applications with finer-grain synchronization. Our hypothesis is that, like CPUs, GPUs will benefit from caches and coherence, but that CPU-style “read for ownership” (RFO) coherence is inappropriate to maintain support for regular streaming workloads. This paper proposes QuickRelease (QR), which improves on conventional GPU memory systems in two ways. First, QR uses a FIFO to enforce the partial order of writes so that synchronization operations can complete without frequent cache flushes. Thus, non-synchronizing threads in QR can re-use cached data even when other threads are performing synchronization. Second, QR partitions the resources required by reads and writes to reduce the penalty of writes on read performance. Simulation results across a wide variety of general-purpose GPU workloads show that QR achieves a 7% average performance improvement compared to a conventional GPU memory system. Furthermore, for emerging workloads with finer-grain synchronization, QR achieves up to 42% performance improvement compared to a conventional GPU memory system without the scalability challenges of RFO coherence. To this end, QR provides a throughput-oriented solution to provide fine-grain synchronization on GPUs.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134266750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

Over-clocked SSD: Safely running beyond flash memory chip I/O clock specs 超频SSD:安全运行超出闪存芯片I/O时钟规格

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835962

Kai Zhao, K. S. Venkataraman, Xuebin Zhang, Jiangpeng Li, Ning Zheng, Tong Zhang

This paper presents a design strategy that enables aggressive use of flash memory chip I/O link over-clocking in solid-state drives (SSDs) without sacrificing storage reliability. The gradual wear-out and process variation of NAND flash memory makes the worst-case oriented error correction code (ECC) in SSDs largely under-utilized most of the time. This work proposes to opportunistically leverage under-utilized error correction strength to allow error-prone flash memory I/O link over-clocking. Its rationale and key design issues are presented and studied in this paper, and its potential effectiveness has been verified through hardware experiments and system simulations. Using sub-22nm NAND flash memory chips with I/O specs of 166MBps, we carried out extensive experiments and show that the proposed design strategy can enable SSDs safely operate with error-prone I/O link running at 275MBps. Trace-driven SSD simulations over a variety of workload traces show the system read response time can be reduced by over 20%.

本文提出了一种设计策略，可以在不牺牲存储可靠性的情况下，在固态硬盘(ssd)中积极使用闪存芯片I/O链路超频。NAND闪存的逐渐损耗和工艺变化使得固态硬盘中的最坏情况纠错码(ECC)在大多数时候都没有得到充分利用。这项工作提出了机会性地利用未充分利用的纠错强度来允许容易出错的闪存I/O链路超频。本文对其基本原理和关键设计问题进行了阐述和研究，并通过硬件实验和系统仿真验证了其潜在的有效性。使用I/O规格为166MBps的sub-22nm NAND闪存芯片，我们进行了大量的实验，并表明所提出的设计策略可以使ssd安全运行，易出错的I/O链路运行在275MBps。跟踪驱动的SSD对各种工作负载跟踪的模拟表明，系统读取响应时间可以减少20%以上。

引用次数: 9

MP3: Minimizing performance penalty for power-gating of Clos network-on-chip MP3:最大限度地减少Clos片上网络电源门控的性能损失

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835940

Lizhong Chen, Lihang Zhao, Ruisheng Wang, T. Pinkston

Power-gating is a promising technique to mitigate the increasing static power of on-chip routers. Clos networks are potentially good targets for power-gating because of their path diversity and decoupling between processing elements and most of the routers. While power-gated Clos networks can perform better than power-gated direct networks such as meshes, a significant performance penalty exists when conventional power-gating techniques are used. In this paper, we propose an effective power-gating scheme, called MP3 (Minimal Performance Penalty Power-gating), which is able to achieve minimal (i.e., near-zero) performance penalty and save more static energy than conventional power-gating applied to Clos networks. MP3 is able to completely remove the wakeup latency from the critical path, reduce long-term and transient contention, and actively steer network traffic to create increased power-gating opportunities. Full system evaluation using PARSEC benchmarks shows that the proposed approach can significantly reduce the performance penalty to less than 1% (as opposed to 38% with conventional power-gating) while saving more than 47% of router static energy, with only 2.5% additional area overhead.

功率门控是一种很有前途的技术，可以缓解片上路由器不断增加的静态功率。Clos网络由于其路径多样性和处理单元与大多数路由器之间的解耦性而成为功率门控的潜在良好目标。虽然功率门控Clos网络可以比功率门控直接网络(如网格)性能更好，但当使用传统的功率门控技术时，存在显着的性能损失。在本文中，我们提出了一种有效的功率门控方案，称为MP3(最小性能惩罚功率门控)，它能够实现最小(即接近零)的性能惩罚，并且比应用于Clos网络的传统功率门控节省更多的静态能量。MP3能够完全消除关键路径上的唤醒延迟，减少长期和瞬态争用，并主动引导网络流量以创造更多的功率门控机会。使用PARSEC基准测试的完整系统评估表明，所提出的方法可以显着将性能损失降低到不到1%(与传统的电源门控相比为38%)，同时节省超过47%的路由器静态能量，仅增加2.5%的面积开销。

{"title":"MP3: Minimizing performance penalty for power-gating of Clos network-on-chip","authors":"Lizhong Chen, Lihang Zhao, Ruisheng Wang, T. Pinkston","doi":"10.1109/HPCA.2014.6835940","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835940","url":null,"abstract":"Power-gating is a promising technique to mitigate the increasing static power of on-chip routers. Clos networks are potentially good targets for power-gating because of their path diversity and decoupling between processing elements and most of the routers. While power-gated Clos networks can perform better than power-gated direct networks such as meshes, a significant performance penalty exists when conventional power-gating techniques are used. In this paper, we propose an effective power-gating scheme, called MP3 (Minimal Performance Penalty Power-gating), which is able to achieve minimal (i.e., near-zero) performance penalty and save more static energy than conventional power-gating applied to Clos networks. MP3 is able to completely remove the wakeup latency from the critical path, reduce long-term and transient contention, and actively steer network traffic to create increased power-gating opportunities. Full system evaluation using PARSEC benchmarks shows that the proposed approach can significantly reduce the performance penalty to less than 1% (as opposed to 38% with conventional power-gating) while saving more than 47% of router static energy, with only 2.5% additional area overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117166353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Strategies for anticipating risk in heterogeneous system design 异构系统设计中的风险预测策略

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835926

Marisabel Guevara, Benjamin Lubin, Benjamin C. Lee

Heterogeneous design presents an opportunity to improve energy efficiency but raises a challenge in resource management. Prior design methodologies aim for performance and efficiency, yet a deployed system may miss these targets due to run-time effects, which we denote as risk. We propose design strategies that explicitly aim to mitigate risk. We introduce new processor selection criteria, such as the coefficient of variation in performance, to produce heterogeneous configurations that balance performance risks and efficiency rewards. Out of the tens of strategies we consider, risk-aware approaches account for more than 70% of the strategies that produce systems with the best service quality. Applying these risk-mitigating strategies to heterogeneous datacenter design can produce a system that violates response time targets 50% less often.

异构设计为提高能源效率提供了机会，但也提出了资源管理方面的挑战。先前的设计方法以性能和效率为目标，但是由于运行时的影响，部署的系统可能会错过这些目标，我们将其称为风险。我们提出了明确旨在降低风险的设计策略。我们引入了新的处理器选择标准，例如性能变化系数，以产生平衡性能风险和效率回报的异构配置。在我们考虑的数十种策略中，风险意识方法占产生最佳服务质量系统的策略的70%以上。将这些风险缓解策略应用于异构数据中心设计可以使系统违反响应时间目标的频率降低50%。

引用次数: 25

Dynamic management of TurboMode in modern multi-core chips TurboMode在现代多核芯片中的动态管理

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835969

David Lo, C. Kozyrakis

Dynamic overclocking of CPUs, or TurboMode, is a feature recently introduced on all x86 multi-core chips. It leverages thermal and power headroom from idle execution resources to overclock active cores to increase performance. TurboMode can accelerate CPU-bound applications at the cost of additional power consumption. Nevertheless, naive use of TurboMode can significantly increase power consumption without increasing performance. Thus far, there is no strategy for managing TurboMode to optimize its use across all workloads and efficiency metrics. This paper analyzes the impact of TurboMode on a wide range of efficiency metrics (performance, power, cost, and combined metrics such as QPS/W and ED2) for representative server workloads on various hardware configurations. We determine that TurboMode is generally beneficial for performance (up to +24%), cost efficiency (QPS/$ up to +8%), energy-delay product (ED, up to +47%), and energy-delay-squared product (ED2, up to +68%). However, TurboMode is inefficient for workloads that exhibit interference for shared resources. We use this information to build and validate a model that predicts the optimal TurboMode setting for each efficiency metric. We then implement autoturbo, a background daemon that dynamically manages TurboMode in real time without any hardware changes. We demonstrate that autoturbo improves QPS/$, ED, and ED2 by 8%, 47%, and 68% respectively over not using TurboMode. At the same time, autoturbo virtually eliminates all the large drops in those same metrics (-12%, -25%, -25% for QPS/$, ED, and ED2) that occur when TurboMode is used naively (always on).

cpu的动态超频(TurboMode)是最近在所有x86多核芯片上引入的功能。它利用空闲执行资源的热量和功率余量来超频活动内核以提高性能。TurboMode可以以额外的功耗为代价加速cpu密集型应用程序。然而，单纯地使用TurboMode可能会显著增加功耗，而不会提高性能。到目前为止，还没有管理TurboMode的策略来优化其在所有工作负载和效率指标中的使用。本文分析了TurboMode对各种硬件配置上的代表性服务器工作负载的各种效率指标(性能、功耗、成本和组合指标，如QPS/W和ED2)的影响。我们确定TurboMode通常有利于性能(高达+24%)，成本效率(QPS/$高达+8%)，能量延迟积(ED，高达+47%)和能量延迟平方积(ED2，高达+68%)。然而，TurboMode对于表现出对共享资源干扰的工作负载是低效的。我们利用这些信息建立并验证了一个模型，该模型可以预测每个效率指标的最佳TurboMode设置。然后我们实现autoturbo，这是一个后台守护进程，可以实时动态地管理TurboMode，而无需任何硬件更改。我们证明，与不使用TurboMode相比，autoturbo将QPS/$、ED和ED2分别提高了8%、47%和68%。同时，当TurboMode被单纯使用(总是开启)时，autoturbo几乎消除了所有相同指标的大幅下降(QPS/$、ED和ED2的下降幅度分别为-12%、-25%和-25%)。

{"title":"Dynamic management of TurboMode in modern multi-core chips","authors":"David Lo, C. Kozyrakis","doi":"10.1109/HPCA.2014.6835969","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835969","url":null,"abstract":"Dynamic overclocking of CPUs, or TurboMode, is a feature recently introduced on all x86 multi-core chips. It leverages thermal and power headroom from idle execution resources to overclock active cores to increase performance. TurboMode can accelerate CPU-bound applications at the cost of additional power consumption. Nevertheless, naive use of TurboMode can significantly increase power consumption without increasing performance. Thus far, there is no strategy for managing TurboMode to optimize its use across all workloads and efficiency metrics. This paper analyzes the impact of TurboMode on a wide range of efficiency metrics (performance, power, cost, and combined metrics such as QPS/W and ED2) for representative server workloads on various hardware configurations. We determine that TurboMode is generally beneficial for performance (up to +24%), cost efficiency (QPS/$ up to +8%), energy-delay product (ED, up to +47%), and energy-delay-squared product (ED2, up to +68%). However, TurboMode is inefficient for workloads that exhibit interference for shared resources. We use this information to build and validate a model that predicts the optimal TurboMode setting for each efficiency metric. We then implement autoturbo, a background daemon that dynamically manages TurboMode in real time without any hardware changes. We demonstrate that autoturbo improves QPS/$, ED, and ED2 by 8%, 47%, and 68% respectively over not using TurboMode. At the same time, autoturbo virtually eliminates all the large drops in those same metrics (-12%, -25%, -25% for QPS/$, ED, and ED2) that occur when TurboMode is used naively (always on).","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126715193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers 沙盒预取:对主动预取器进行安全的运行时评估

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835971

Seth H. Pugsley, Zeshan A. Chishti, C. Wilkerson, Peng-fei Chuang, Robert L. Scott, A. Jaleel, Shih-Lien Lu, K. Chow, R. Balasubramonian

Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.

内存延迟是限制CPU性能的一个主要因素，预取是一种众所周知的隐藏内存延迟的方法。过于激进的预取会浪费内存带宽和缓存容量等稀缺资源，限制甚至损害性能。因此，重要的是采用审慎使用这些资源的预取机制，同时仍能及时预取所需的数据。在这项工作中，我们提出了一种新的机制，在运行时为当前执行的程序确定合适的预取机制，称为沙盒预取。沙盒预取在运行时通过将预取地址添加到Bloom过滤器来评估简单，积极的偏移预取器，而不是实际将数据获取到缓存中。随后的缓存访问将针对Bloom过滤器的内容进行测试，以查看正在评估的主动预取器是否可以准确地预取数据，同时测试是否存在可预取流。当预取器的评估精度超过阈值时，执行真正的预取。该方法结合了全局模式确认和即时预取的思想，实现了高性能。与不使用任何预取相比，沙箱预取在测试工作负载上的性能提高了47.6%，与反馈定向预取技术相比提高了18.7%。与访问映射模式匹配预取器相比，性能也提高了1.4%，同时产生的逻辑和存储开销也大大减少。

{"title":"Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers","authors":"Seth H. Pugsley, Zeshan A. Chishti, C. Wilkerson, Peng-fei Chuang, Robert L. Scott, A. Jaleel, Shih-Lien Lu, K. Chow, R. Balasubramonian","doi":"10.1109/HPCA.2014.6835971","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835971","url":null,"abstract":"Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123796846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106