2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献

Improving GPGPU resource utilization through alternative thread block scheduling 通过可选的线程块调度提高GPGPU资源利用率

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835937

Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu

High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.

通过最大化并行性和充分利用可用资源来获得GPGPU工作负载下的高性能。数千个线程以CTA(合作线程阵列)或线程块的单位分配给每个内核，每个线程块由多个经线或波阵组成。线程的调度会对整体性能产生重大影响。在这项工作中，探索替代线程块或CTA调度;特别是，我们利用线程块调度器和warp调度器之间的交互来提高性能。我们探讨了线程块调度的两个方面:(1)LCS(惰性CTA调度)，它限制了分配给每个核心的线程块的最大数量，以及(2)BCS(块CTA调度)，其中连续的线程块被分配给同一个核心。对于LCS，我们利用贪婪的warp调度器来帮助通过仅测量发出的指令数量来确定线程块的最佳数量，而对于BCS，我们提出了一个替代的warp调度器，它知道分配给核心的cta的“块”。有了LCS，并且观察到cta的最大数量并不一定会使性能最大化，我们还提出了混合并发内核执行，它允许将多个内核分配到同一个内核，以最大限度地提高资源利用率并提高整体性能。

{"title":"Improving GPGPU resource utilization through alternative thread block scheduling","authors":"Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/HPCA.2014.6835937","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835937","url":null,"abstract":"High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115237047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 155

MemZip: Exploring unconventional benefits from memory compression MemZip:探索内存压缩的非常规好处

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835972

Ali Shafiee, Meysam Taassori, R. Balasubramonian, A. Davis

Memory compression has been proposed and deployed in the past to grow the capacity of a memory system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy and bandwidth demands. However, most prior mechanisms have been designed to focus on the capacity metric and few prior works have attempted to explicitly reduce energy or bandwidth. Further, mechanisms that focus on the capacity metric also require complex logic to locate the requested data in memory. In this paper, we design a highly simple compressed memory architecture that does not target the capacity metric. Instead, it focuses on complexity, energy, bandwidth, and reliability. It relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits. Further, the space made available via compression is used to boost other metrics - the space can be used to implement stronger error correction codes or energy-efficient data encodings. The best performing MemZip configuration yields a 45% performance improvement and 57% memory energy reduction, compared to an uncompressed non-sub-ranked baseline. Another energy-optimized configuration yields a 29.8% performance improvement and a 79% memory energy reduction, relative to the same baseline.

内存压缩在过去已经被提出和部署，以增加内存系统的容量和降低页面故障率。压缩还有第二个好处:它可以减少能量和带宽需求。然而，大多数先前的机制都被设计为关注容量度量，并且很少有先前的工作试图明确地减少能量或带宽。此外，关注容量度量的机制还需要复杂的逻辑来定位内存中的请求数据。在本文中，我们设计了一个高度简单的压缩内存架构，它不以容量度量为目标。相反，它关注的是复杂性、能量、带宽和可靠性。它依赖于秩子集和仔细放置压缩数据和元数据来实现这些好处。此外，通过压缩获得的空间可用于提高其他指标——该空间可用于实现更强的纠错码或节能的数据编码。与未压缩的非分级基准相比，性能最好的MemZip配置可以提高45%的性能，减少57%的内存能耗。相对于相同的基线，另一种能源优化配置的性能提高了29.8%，内存能耗降低了79%。

{"title":"MemZip: Exploring unconventional benefits from memory compression","authors":"Ali Shafiee, Meysam Taassori, R. Balasubramonian, A. Davis","doi":"10.1109/HPCA.2014.6835972","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835972","url":null,"abstract":"Memory compression has been proposed and deployed in the past to grow the capacity of a memory system and reduce page fault rates. Compression also has secondary benefits: it can reduce energy and bandwidth demands. However, most prior mechanisms have been designed to focus on the capacity metric and few prior works have attempted to explicitly reduce energy or bandwidth. Further, mechanisms that focus on the capacity metric also require complex logic to locate the requested data in memory. In this paper, we design a highly simple compressed memory architecture that does not target the capacity metric. Instead, it focuses on complexity, energy, bandwidth, and reliability. It relies on rank subsetting and a careful placement of compressed data and metadata to achieve these benefits. Further, the space made available via compression is used to boost other metrics - the space can be used to implement stronger error correction codes or energy-efficient data encodings. The best performing MemZip configuration yields a 45% performance improvement and 57% memory energy reduction, compared to an uncompressed non-sub-ranked baseline. Another energy-optimized configuration yields a 29.8% performance improvement and a 79% memory energy reduction, relative to the same baseline.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122192360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

Practical data value speculation for future high-end processors 未来高端处理器的实用数据价值推测

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835952

Arthur Perais, André Seznec

Dedicating more silicon area to single thread performance will necessarily be considered as worthwhile in future - potentially heterogeneous - multicores. In particular, Value prediction (VP) was proposed in the mid 90's to enhance the performance of high-end uniprocessors by breaking true data dependencies. In this paper, we reconsider the concept of Value Prediction in the contemporary context and show its potential as a direction to improve current single thread performance. First, building on top of research carried out during the previous decade on confidence estimation, we show that every value predictor is amenable to very high prediction accuracy using very simple hardware. This clears the path to an implementation of VP without a complex selective reissue mechanism to absorb mispredictions. Prediction is performed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the out-of-order engine is only marginally modified. Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle. To bypass this requirement, we introduce a new value predictor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based predictors. Specifically, using SPEC'00 and SPEC'06 benchmarks, our simulations show that combining VTAGE and a stride based predictor yields up to 65% speedup on a fairly aggressive pipeline without support for selective reissue.

将更多的硅片面积用于单线程性能，在未来(可能是异构的)多核中是值得考虑的。特别是，值预测(VP)是在90年代中期提出的，通过打破真正的数据依赖关系来增强高端单处理器的性能。在本文中，我们在当代背景下重新考虑了值预测的概念，并展示了其作为改进当前单线程性能方向的潜力。首先，在过去十年中对置信度估计进行的研究的基础上，我们表明每个值预测器都可以使用非常简单的硬件实现非常高的预测精度。这为VP的实现扫清了道路，而不需要复杂的选择性重新发布机制来吸收错误的预测。预测在有序管道前端执行，验证在有序管道后端执行，而乱序引擎只被略微修改。其次，当预测同一指令的连续出现时，先前基于上下文的值预测器依赖于本地值历史，表现出复杂的关键循环，理想情况下应该在单个循环中实现。为了绕过这个需求，我们引入一个利用全局分支历史的新的值预测器VTAGE。VTAGE可以无缝地预测连续发生的事件，允许预测跨越几个周期。它比以前提出的基于上下文的预测器实现了更高的性能。具体来说，使用SPEC'00和SPEC'06基准测试，我们的模拟表明，在不支持选择性重新发行的情况下，在相当积极的管道上，结合VTAGE和基于跨步的预测器可以产生高达65%的加速。

{"title":"Practical data value speculation for future high-end processors","authors":"Arthur Perais, André Seznec","doi":"10.1109/HPCA.2014.6835952","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835952","url":null,"abstract":"Dedicating more silicon area to single thread performance will necessarily be considered as worthwhile in future - potentially heterogeneous - multicores. In particular, Value prediction (VP) was proposed in the mid 90's to enhance the performance of high-end uniprocessors by breaking true data dependencies. In this paper, we reconsider the concept of Value Prediction in the contemporary context and show its potential as a direction to improve current single thread performance. First, building on top of research carried out during the previous decade on confidence estimation, we show that every value predictor is amenable to very high prediction accuracy using very simple hardware. This clears the path to an implementation of VP without a complex selective reissue mechanism to absorb mispredictions. Prediction is performed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the out-of-order engine is only marginally modified. Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle. To bypass this requirement, we introduce a new value predictor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based predictors. Specifically, using SPEC'00 and SPEC'06 benchmarks, our simulations show that combining VTAGE and a stride based predictor yields up to 65% speedup on a fairly aggressive pipeline without support for selective reissue.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116605795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Atomic SC for simple in-order processors 用于简单顺序处理器的原子SC

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835950

Dibakar Gope, Mikko H. Lipasti

Sequential consistency is arguably the most intuitive memory consistency model for shared-memory multi-threaded programming, yet it appears to be a poor fit for simple, in-order processors that are most attractive in the power-constrained many-core era. This paper proposes an intuitively appealing and straightforward framework for ensuring sequentially consistent execution. Prior schemes have enabled similar reordering, but in ways that are most naturally implemented in aggressive out-of-order processors that support speculative execution or that require pervasive and error-prone revisions to the already-complex coherence protocols. The proposed Atomic SC approach adds a light-weight scheme for enforcing mutual exclusion to maintain proper SC order for reordered references, works without any alteration to the underlying coherence protocol and consumes minimal silicon area and energy. On an in-order processor running multithreaded PARSEC workloads, Atomic SC delivers performance that is equal to or better than prior SC-compatible schemes, which require much greater energy and design complexity.

顺序一致性可以说是共享内存多线程编程中最直观的内存一致性模型，但它似乎不太适合简单的顺序处理器，而顺序处理器在功率受限的多核时代最具吸引力。本文提出了一个直观的吸引人的和直接的框架，以确保顺序一致的执行。先前的方案已经实现了类似的重新排序，但是以最自然的方式在激进的乱序处理器中实现，这些处理器支持推测执行，或者需要对已经很复杂的一致性协议进行普遍且容易出错的修改。提出的Atomic SC方法增加了一种轻量级的方案，用于强制互斥，以保持重排引用的适当SC顺序，在不改变底层相干协议的情况下工作，并且消耗最小的硅面积和能量。在运行多线程PARSEC工作负载的有序处理器上，Atomic SC提供的性能等于或优于先前的SC兼容方案，后者需要更大的精力和设计复杂性。

引用次数: 16

Warp-level divergence in GPUs: Characterization, impact, and mitigation gpu翘曲水平偏差:表征、影响和缓解

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835939

Ping Xiang, Yi Yang, Huiyang Zhou

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming multiprocessor (SM). In this paper, we highlight that such TB-level resource management can severely affect the TLP that may be achieved in the hardware. First, different warps in a TB may finish at different times, which we refer to as `warp-level divergence'. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. Second, TB-level management can lead to resource fragmentation. For example, the maximum number of threads to run on an SM in an NVIDIA GTX 480 GPU is 1536. For an application with a TB containing 1024 threads, only 1 TB can run on the SM even though it has sufficient resource for a few hundreds more threads. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources. We present our lightweight architectural support for our proposed warp-level resource management. The experimental results show that our approach achieves up to 76.0% and an average of 16.0% performance gains and up to 21.7% and an average of 6.7% energy savings at minor hardware overhead.

高吞吐量架构依赖于高线程级并行性(TLP)来隐藏执行延迟。在最先进的图形处理单元(gpu)中，线程组织在线程块(TB)网格中，每个TB包含数十到数百个线程。在TB级别的资源管理方案中，TB所需的所有资源在被分派到流多处理器(SM)中完成时被分配/释放。在本文中，我们强调这种tb级的资源管理会严重影响在硬件中可能实现的TLP。首先，TB中不同的经线可能在不同的时间结束，我们称之为“经线水平散度”。由于TB级别的资源管理，分配给早期完成的经线的资源基本上是浪费的，因为它们需要等待相同TB中最长运行的经线完成。第二，结核病级别的管理可能导致资源碎片化。例如，在NVIDIA GTX 480 GPU中，SM上运行的最大线程数是1536。对于一个TB包含1024个线程的应用程序，即使它有足够的资源容纳几百个线程，也只能在SM上运行1tb。为了克服这些低效率，我们建议在warp级别分配和释放资源。只要SM有足够的资源而不是TB，就会将曲速分配给SM。此外，每当一个经线完成，它的资源被释放，可以容纳一个新的经线。这样，我们有效地增加了活动翘曲的数量，而实际上没有增加关键资源的大小。我们为我们提议的warp级资源管理提供轻量级架构支持。实验结果表明，我们的方法在较小的硬件开销下实现了高达76.0%和平均16.0%的性能提升，高达21.7%和平均6.7%的能源节约。

{"title":"Warp-level divergence in GPUs: Characterization, impact, and mitigation","authors":"Ping Xiang, Yi Yang, Huiyang Zhou","doi":"10.1109/HPCA.2014.6835939","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835939","url":null,"abstract":"High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming multiprocessor (SM). In this paper, we highlight that such TB-level resource management can severely affect the TLP that may be achieved in the hardware. First, different warps in a TB may finish at different times, which we refer to as `warp-level divergence'. Due to TB-level resource management, the resources allocated to early finished warps are essentially wasted as they need to wait for the longest running warp in the same TB to finish. Second, TB-level management can lead to resource fragmentation. For example, the maximum number of threads to run on an SM in an NVIDIA GTX 480 GPU is 1536. For an application with a TB containing 1024 threads, only 1 TB can run on the SM even though it has sufficient resource for a few hundreds more threads. To overcome these inefficiencies, we propose to allocate and release resources at the warp level. Warps are dispatched to an SM as long as it has sufficient resource for a warp rather than a TB. Furthermore, whenever a warp is completed, its resource is released and can accommodate a new warp. This way, we effectively increase the number of active warps without actually increasing the size of critical resources. We present our lightweight architectural support for our proposed warp-level resource management. The experimental results show that our approach achieves up to 76.0% and an average of 16.0% performance gains and up to 21.7% and an average of 6.7% energy savings at minor hardware overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130975598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 66

Supporting x86-64 address translation for 100s of GPU lanes 支持100个GPU通道的x86-64地址转换

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835965

Jason Power, M. Hill, D. Wood

Efficient memory sharing between CPU and GPU threads can greatly expand the effective set of GPGPU workloads. For increased programmability, this memory should be uniformly virtualized, necessitating compatible address translation support for GPU memory references. However, even a modest GPU might need 100s of translations per cycle (6 CUs * 64 lanes/CU) with memory access patterns designed for throughput more than locality. To drive GPU MMU design, we examine GPU memory reference behavior with the Rodinia benchmarks and a database sort to find: (1) the coalescer and scratchpad memory are effective TLB bandwidth filters (reducing the translation rate by 6.8x on average), (2) TLB misses occur in bursts (60 concurrently on average), and (3) postcoalescer TLBs have high miss rates (29% average). We show how a judicious combination of extant CPU MMU ideas satisfies GPU MMU demands for 4 KB pages with minimal overheads (an average of less than 2% over ideal address translation). This proof-of-concept design uses per-compute unit TLBs, a shared highly-threaded page table walker, and a shared page walk cache.

CPU和GPU线程之间高效的内存共享可以极大地扩展GPGPU工作负载的有效集合。为了提高可编程性，该内存应该统一虚拟化，这就需要为GPU内存引用提供兼容的地址转换支持。然而，即使是一个普通的GPU，每个周期也可能需要100次转换(6 CU * 64 lane /CU)，并且内存访问模式是为吞吐量而不是局域性设计的。为了驱动GPU MMU设计，我们使用Rodinia基准测试和数据库分类检查GPU内存参考行为，发现:(1)聚结器和刮刮板内存是有效的TLB带宽过滤器(平均降低6.8倍的转译率)，(2)TLB丢失发生在突发(平均并发60次)中，(3)聚结器后TLB具有高丢失率(平均29%)。我们展示了现有CPU MMU思想的明智组合如何以最小的开销(比理想地址转换平均不到2%)满足GPU MMU对4 KB页面的需求。这种概念验证设计使用每个计算单元tlb、共享高线程页表漫游器和共享页漫游缓存。

引用次数: 148

Revolver: Processor architecture for power efficient loop execution 左轮手枪:用于高效循环执行的处理器架构

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835968

Mitchell Hayenga, Vignyan Reddy Kothinti Naresh, Mikko H. Lipasti

With the rise of mobile and cloud-based computing, modern processor design has become the task of achieving maximum power efficiency at specific performance targets. This trend, coupled with dwindling improvements in single-threaded performance, has led architects to predominately focus on energy efficiency. In this paper we note that for the majority of benchmarks, a substantial portion of execution time is spent executing simple loops. Capitalizing on the frequency of loops, we design an out-of-order processor architecture that achieves an aggressive level of performance while minimizing the energy consumed during the execution of loops. The Revolver architecture achieves energy efficiency during loop execution by enabling “in-place execution” of loops within the processor's out-of-order backend. Essentially, a few static instances of each loop instruction are dispatched to the out-of-order execution core by the processor frontend. The static instruction instances may each be executed multiple times in order to complete all necessary loop iterations. During loop execution the processor frontend, including instruction fetch, branch prediction, decode, allocation, and dispatch logic, can be completely clock gated. Additionally we propose a mechanism to preexecute future loop iteration load instructions, thereby realizing parallelism beyond the loop iterations currently executing within the processor core. Employing Revolver across three benchmark suites, we eliminate 20, 55, and 84% of all frontend instruction dispatches. Overall, we find Revolver maintains performance, while resulting in 5.3%-18.3% energy-delay benefit over loop buffers or micro-op cache techniques alone.

随着移动和云计算的兴起，现代处理器设计已经成为在特定性能目标下实现最大功率效率的任务。这种趋势，再加上单线程性能的改进逐渐减少，导致架构师主要关注能源效率。在本文中，我们注意到，对于大多数基准测试，执行时间的很大一部分用于执行简单的循环。利用循环的频率，我们设计了一种乱序处理器架构，该架构在实现积极的性能水平的同时，最大限度地减少了循环执行期间消耗的能量。Revolver架构通过在处理器的乱序后端内启用循环的“就地执行”来实现循环执行期间的能源效率。本质上，每个循环指令的几个静态实例由处理器前端分派到乱序执行核心。为了完成所有必要的循环迭代，每个静态指令实例可能被执行多次。在循环执行期间，处理器前端，包括指令获取、分支预测、解码、分配和调度逻辑，可以完全进行时钟门控。此外，我们提出了一种预执行未来循环迭代加载指令的机制，从而实现超越当前在处理器核心内执行的循环迭代的并行性。在三个基准套件中使用Revolver，我们消除了所有前端指令调度的20%、55%和84%。总的来说，我们发现Revolver保持了性能，同时比循环缓冲区或微操作缓存技术单独获得5.3%-18.3%的能量延迟优势。

{"title":"Revolver: Processor architecture for power efficient loop execution","authors":"Mitchell Hayenga, Vignyan Reddy Kothinti Naresh, Mikko H. Lipasti","doi":"10.1109/HPCA.2014.6835968","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835968","url":null,"abstract":"With the rise of mobile and cloud-based computing, modern processor design has become the task of achieving maximum power efficiency at specific performance targets. This trend, coupled with dwindling improvements in single-threaded performance, has led architects to predominately focus on energy efficiency. In this paper we note that for the majority of benchmarks, a substantial portion of execution time is spent executing simple loops. Capitalizing on the frequency of loops, we design an out-of-order processor architecture that achieves an aggressive level of performance while minimizing the energy consumed during the execution of loops. The Revolver architecture achieves energy efficiency during loop execution by enabling “in-place execution” of loops within the processor's out-of-order backend. Essentially, a few static instances of each loop instruction are dispatched to the out-of-order execution core by the processor frontend. The static instruction instances may each be executed multiple times in order to complete all necessary loop iterations. During loop execution the processor frontend, including instruction fetch, branch prediction, decode, allocation, and dispatch logic, can be completely clock gated. Additionally we propose a mechanism to preexecute future loop iteration load instructions, thereby realizing parallelism beyond the loop iterations currently executing within the processor core. Employing Revolver across three benchmark suites, we eliminate 20, 55, and 84% of all frontend instruction dispatches. Overall, we find Revolver maintains performance, while resulting in 5.3%-18.3% energy-delay benefit over loop buffers or micro-op cache techniques alone.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121673218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Low-overhead and high coverage run-time race detection through selective meta-data management 通过选择性元数据管理进行低开销和高覆盖率的运行时竞争检测

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835979

Ruirui C. Huang, Erik Halberg, Andrew Ferraiuolo, G. Suh

This paper presents an efficient hardware architecture that enables run-time data race detection with high coverage and minimal performance overhead. Run-time race detectors often rely on the happens-before vector clock algorithm for accuracy, yet suffer from either non-negligible performance overhead or low detection coverage due to a large amount of meta-data. Based on the observation that most of data races happen between close-by accesses, we introduce an optimization to selectively store meta-data only for recently shared memory locations and decouple meta-data storage from regular data storage such as caches. Experiments show that the proposed scheme enables run-time race detection with a minimal impact on performance (4.8% overhead on average) with very high detection coverage (over 99%). Furthermore, this architecture only adds a small amount of on-chip resources for race detection: a 13-KB buffer per core and a 1-bit tag per data cache block.

本文提出了一种高效的硬件体系结构，可以实现高覆盖率和最小性能开销的运行时数据竞争检测。运行时竞争检测器通常依赖于happens-before矢量时钟算法来获得准确性，然而，由于大量元数据，它们要么遭受不可忽略的性能开销，要么遭受低检测覆盖率的影响。根据对大多数数据竞争发生在邻近访问之间的观察，我们引入了一种优化，可以选择性地仅为最近共享的内存位置存储元数据，并将元数据存储与常规数据存储(如缓存)解耦。实验表明，所提出的方案使运行时竞争检测对性能的影响最小(平均开销为4.8%)，检测覆盖率非常高(超过99%)。此外，该体系结构仅为争用检测添加了少量片上资源:每个内核一个13 kb缓冲区，每个数据缓存块一个1位标签。

引用次数: 7

Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach 通过弱依赖去除加速解耦的前瞻性:一种元启发式方法

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835974

Raj Parihar, Michael C. Huang

Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. Look-ahead is a tried-and-true strategy in uncovering implicit parallelism, but a conventional, monolithic out-of-order core quickly becomes resource-inefficient when looking beyond a small distance. A more decoupled approach with an independent, dedicated look-ahead thread on a separate thread context can be a more flexible and effective implementation, especially in a multi-core environment. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit. Fortunately, the look-ahead thread has no hard correctness constraints and presents new opportunities for optimizations. One such opportunity is to exploit “weak” dependences. Intuitively, not all dependences are equal. Some links in a dependence chain are weak enough that removing them in the look-ahead thread does not materially affect the quality of look-ahead but improves the speed. While there are some common patterns of weak dependences, they can not be generalized as heuristics in generating better code for the look-ahead thread. A primary reason is that removing a false weak dependence can be exceedingly costly. Nevertheless, a trial-and-error approach can reliably identify opportunities for improving the look-ahead thread and quantify the benefits. A framework based on genetic algorithm can help search for the right set of changes to the look-ahead thread. In the set of applications where the speed of look-ahead has become the new limit, this method is found to improve the overall system performance by up to 1.48x with a geometric mean of 1.14x over the baseline decoupled look-ahead system, while reducing energy consumption by 11%.

尽管多核和多线程体系结构在不断发展，但为单个语义线程开发隐式并行性仍然是实现高性能的关键组成部分。在发现隐式并行性方面，提前查找是一种可靠的策略，但是当查找超出一小段距离时，传统的单片乱序核心很快就会变得资源效率低下。在单独的线程上下文中使用独立的、专用的预检线程的解耦方法可能是一种更灵活、更有效的实现，特别是在多核环境中。虽然能够产生显著的性能提升，但向前看代理经常成为新的速度限制。幸运的是，前瞻性线程没有硬性的正确性约束，并为优化提供了新的机会。其中一个机会就是利用“弱”依赖。直观地说，并非所有依赖项都是相等的。依赖链中的一些链接足够弱，因此在预查线程中删除它们不会对预查的质量产生实质性影响，但会提高速度。虽然存在一些常见的弱依赖模式，但它们不能被概括为为前瞻性线程生成更好代码的启发式方法。一个主要原因是，移除一个虚假的弱依赖关系的成本可能非常高。然而，试错方法可以可靠地确定改进前瞻性线程的机会，并量化其好处。基于遗传算法的框架可以帮助搜索前瞻线程的正确更改集。在一些应用中，预判速度已经成为新的限制，该方法被发现比基线解耦预判系统提高了1.48倍的整体系统性能，几何平均值为1.14倍，同时降低了11%的能耗。

{"title":"Accelerating decoupled look-ahead via weak dependence removal: A metaheuristic approach","authors":"Raj Parihar, Michael C. Huang","doi":"10.1109/HPCA.2014.6835974","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835974","url":null,"abstract":"Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. Look-ahead is a tried-and-true strategy in uncovering implicit parallelism, but a conventional, monolithic out-of-order core quickly becomes resource-inefficient when looking beyond a small distance. A more decoupled approach with an independent, dedicated look-ahead thread on a separate thread context can be a more flexible and effective implementation, especially in a multi-core environment. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit. Fortunately, the look-ahead thread has no hard correctness constraints and presents new opportunities for optimizations. One such opportunity is to exploit “weak” dependences. Intuitively, not all dependences are equal. Some links in a dependence chain are weak enough that removing them in the look-ahead thread does not materially affect the quality of look-ahead but improves the speed. While there are some common patterns of weak dependences, they can not be generalized as heuristics in generating better code for the look-ahead thread. A primary reason is that removing a false weak dependence can be exceedingly costly. Nevertheless, a trial-and-error approach can reliably identify opportunities for improving the look-ahead thread and quantify the benefits. A framework based on genetic algorithm can help search for the right set of changes to the look-ahead thread. In the set of applications where the speed of look-ahead has become the new limit, this method is found to improve the overall system performance by up to 1.48x with a geometric mean of 1.14x over the baseline decoupled look-ahead system, while reducing energy consumption by 11%.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131742265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks 非包容性内存权限架构，用于防止跨层攻击

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835931

J. Elwell, Ryan D. Riley, N. Abu-Ghazaleh, D. Ponomarev

Protecting modern computer systems and complex software stacks against the growing range of possible attacks is becoming increasingly difficult. The architecture of modern commodity systems allows attackers to subvert privileged system software often using a single exploit. Once the system is compromised, inclusive permissions used by current architectures and operating systems easily allow a compromised high-privileged software layer to perform arbitrary malicious activities, even on behalf of other software layers. This paper presents a hardware-supported page permission scheme for the physical pages that is based on the concept of non-inclusive sets of memory permissions for different layers of system software such as hypervisors, operating systems, and user-level applications. Instead of viewing privilege levels as an ordered hierarchy with each successive level being more privileged, we view them as distinct levels each with its own set of permissions. Such a permission mechanism, implemented as part of a processor architecture, provides a common framework for defending against a range of recent attacks. We demonstrate that such a protection can be achieved with negligible performance overhead, low hardware complexity and minimal changes to the commodity OS and hypervisor code.

保护现代计算机系统和复杂的软件栈免受越来越多可能的攻击变得越来越困难。现代商品系统的架构允许攻击者通常使用一个漏洞来破坏特权系统软件。一旦系统遭到破坏，当前体系结构和操作系统使用的包容性权限很容易允许被破坏的高特权软件层执行任意恶意活动，甚至代表其他软件层。本文提出了一种硬件支持的物理页面权限方案，该方案基于不同系统软件层(如管理程序、操作系统和用户级应用程序)的非包容性内存权限集的概念。我们不是将特权级别视为一个有序的层次结构，每个级别的特权都更高，而是将它们视为不同的级别，每个级别都有自己的一组权限。这种权限机制作为处理器体系结构的一部分实现，为防御最近的一系列攻击提供了一个通用框架。我们证明，这种保护可以通过忽略性能开销、低硬件复杂性和对商用操作系统和管理程序代码的最小更改来实现。

{"title":"A Non-Inclusive Memory Permissions architecture for protection against cross-layer attacks","authors":"J. Elwell, Ryan D. Riley, N. Abu-Ghazaleh, D. Ponomarev","doi":"10.1109/HPCA.2014.6835931","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835931","url":null,"abstract":"Protecting modern computer systems and complex software stacks against the growing range of possible attacks is becoming increasingly difficult. The architecture of modern commodity systems allows attackers to subvert privileged system software often using a single exploit. Once the system is compromised, inclusive permissions used by current architectures and operating systems easily allow a compromised high-privileged software layer to perform arbitrary malicious activities, even on behalf of other software layers. This paper presents a hardware-supported page permission scheme for the physical pages that is based on the concept of non-inclusive sets of memory permissions for different layers of system software such as hypervisors, operating systems, and user-level applications. Instead of viewing privilege levels as an ordered hierarchy with each successive level being more privileged, we view them as distinct levels each with its own set of permissions. Such a permission mechanism, implemented as part of a processor architecture, provides a common framework for defending against a range of recent attacks. We demonstrate that such a protection can be achieved with negligible performance overhead, low hardware complexity and minimal changes to the commodity OS and hypervisor code.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"247 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132752979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8