2014 47th Annual IEEE/ACM International Symposium on Microarchitecture最新文献

Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures 城堡:有效地保护堆叠内存从大粒度故障

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.57

Prashant J. Nair, D. Roberts, Moinuddin K. Qureshi

Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory level parallelism causing significant slowdown and higher power consumption. This paper proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, thus allowing high performance, lower power and efficiently protects the stacked memory from large-granularity failures. Citadel consists of three components, TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs, Tri Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures, and Dynamic Dual Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700x higher reliability than Chip Kill-like ECC codes.

堆叠式内存模块可能与处理器紧密集成。这些内存模块可靠地运行是至关重要的，因为内存故障可能需要更换整个插座。更糟糕的是，堆叠存储器设计容易受到新的故障模式的影响(例如，由于有缺陷的硅通孔，或tsv)，这可能导致大部分存储器(如存储库)出现故障。为了避免大粒度故障造成的数据丢失，存储系统可能会使用基于符号的代码，这些代码将缓存线路上的数据分条到多个银行(或通道)。不幸的是，这种数据条带化降低了内存级别的并行性，导致显著的减速和更高的功耗。本文提出了一种健壮的内存架构Citadel，它允许内存系统在一个银行中保留每个缓存线，从而实现高性能，低功耗并有效地保护堆叠内存免受大粒度故障的影响。Citadel由三个组件组成:TSV-Swap，它可以容忍故障数据tsv和故障地址tsv;三维奇偶校验(3DP)，它可以容忍列故障、行故障和库故障;动态双粒度保留(DDS)，它可以通过动态保留行粒度或库粒度的故障内存区域来减轻永久故障。我们对DRAM故障的实际数据进行的评估表明，Citadel提供的性能和功率类似于在同一银行中维护整个缓存线，但提供的可靠性比类似Chip kill的ECC代码高700倍。

{"title":"Citadel: Efficiently Protecting Stacked Memory from Large Granularity Failures","authors":"Prashant J. Nair, D. Roberts, Moinuddin K. Qureshi","doi":"10.1109/MICRO.2014.57","DOIUrl":"https://doi.org/10.1109/MICRO.2014.57","url":null,"abstract":"Stacked memory modules are likely to be tightly integrated with the processor. It is vital that these memory modules operate reliably, as memory failure can require the replacement of the entire socket. To make matters worse, stacked memory designs are susceptible to newer failure modes (for example, due to faulty through-silicon vias, or TSVs) that can cause large portions of memory, such as a bank, to become faulty. To avoid data loss from large-granularity failures, the memory system may use symbol-based codes that stripe the data for a cache line across several banks (or channels). Unfortunately, such data-striping reduces memory level parallelism causing significant slowdown and higher power consumption. This paper proposes Citadel, a robust memory architecture that allows the memory system to retain each cache line within one bank, thus allowing high performance, lower power and efficiently protects the stacked memory from large-granularity failures. Citadel consists of three components, TSV-Swap, which can tolerate both faulty data-TSVs and faulty address-TSVs, Tri Dimensional Parity (3DP), which can tolerate column failures, row failures, and bank failures, and Dynamic Dual Granularity Sparing (DDS), which can mitigate permanent faults by dynamically sparing faulty memory regions either at a row granularity or at a bank granularity. Our evaluations with real-world data for DRAM failures show that Citadel provides performance and power similar to maintaining the entire cache line in the same bank, and yet provides 700x higher reliability than Chip Kill-like ECC codes.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"20 1","pages":"51-62"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81708552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Arbitrary Modulus Indexing 任意模分度

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.13

Jeff Diamond, D. Fussell, S. Keckler

Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor's computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations. Address mapping is most easily performed by indexing the banks using a mod (2 N) indexing scheme, but such schemes interact poorly with the memory access patterns of many computations, making resource conflicts a significant memory system bottleneck. Previous work has assumed that prime moduli are the best choices to alleviate conflicts and has concentrated on finding efficient implementations for them. In this paper, we introduce a new scheme called Arbitrary Modulus Indexing (AMI) that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. We also demonstrate that, for a memory-intensive workload on a modern replay-style GPU architecture, prime moduli are not in general the best choices for memory bank and cache set mappings. Applying AMI to set of memory intensive benchmarks eliminates 98% of bank and set conflicts, resulting in an average speedup of 24% over an aggressive baseline system and a 64% average reduction in memory system replays at reasonable implementation cost.

现代高性能处理器要求存储系统能够以与处理器的计算速率相匹配的速率提供对数据的访问。这类系统的共同特点是将内存组织到可以并行访问的本地高速内存库中。通过索引代替联想记忆，使值的联想查找变得高效。当数据位置没有统一地映射到银行或缓存位置时，这些技术将失去有效性，从而导致由于对位置子集的过度需求而产生的瓶颈。地址映射最容易通过使用mod (2n)索引方案索引银行来执行，但是这种方案与许多计算的内存访问模式交互很差，使得资源冲突成为一个重要的内存系统瓶颈。以前的工作假设素数模是缓解冲突的最佳选择，并集中于寻找它们的有效实现。在本文中，我们介绍了一种称为任意模索引(AMI)的新方案，该方案可以有效地实现所有模，匹配或提高现有最佳素数方案的效率，同时允许极大的灵活性选择模来优化成本/性能权衡。我们还证明，对于现代重放式GPU架构上的内存密集型工作负载，素数模通常不是内存库和缓存集映射的最佳选择。将AMI应用于一组内存密集型基准测试，消除了98%的bank和set冲突，在合理的实现成本下，与积极的基线系统相比，平均速度提高了24%，内存系统重播平均减少了64%。

{"title":"Arbitrary Modulus Indexing","authors":"Jeff Diamond, D. Fussell, S. Keckler","doi":"10.1109/MICRO.2014.13","DOIUrl":"https://doi.org/10.1109/MICRO.2014.13","url":null,"abstract":"Modern high performance processors require memory systems that can provide access to data at a rate that is well matched to the processor's computation rate. Common to such systems is the organization of memory into local high speed memory banks that can be accessed in parallel. Associative look up of values is made efficient through indexing instead of associative memories. These techniques lose effectiveness when data locations are not mapped uniformly to the banks or cache locations, leading to bottlenecks that arise from excess demand on a subset of locations. Address mapping is most easily performed by indexing the banks using a mod (2 N) indexing scheme, but such schemes interact poorly with the memory access patterns of many computations, making resource conflicts a significant memory system bottleneck. Previous work has assumed that prime moduli are the best choices to alleviate conflicts and has concentrated on finding efficient implementations for them. In this paper, we introduce a new scheme called Arbitrary Modulus Indexing (AMI) that can be implemented efficiently for all moduli, matching or improving the efficiency of the best existing schemes for primes while allowing great flexibility in choosing a modulus to optimize cost/performance trade-offs. We also demonstrate that, for a memory-intensive workload on a modern replay-style GPU architecture, prime moduli are not in general the best choices for memory bank and cache set mappings. Applying AMI to set of memory intensive benchmarks eliminates 98% of bank and set conflicts, resulting in an average speedup of 24% over an aggressive baseline system and a 64% average reduction in memory system replays at reasonable implementation cost.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"7 1","pages":"140-152"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79091568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers 千变万化代码:为仓库规模的计算机实现近乎免费的在线代码转换

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.21

M. Laurenzano, Yunqi Zhang, Lingjia Tang, Jason Mars

Rampant dynamism due to load fluctuations, co runner changes, and varying levels of interference poses a threat to application quality of service (QoS) and has limited our ability to allow co-locations in modern warehouse scale computers (WSCs). Instruction set features such as the non-temporal memory access hints found in modern ISAs (both ARM and x86) may be useful in mitigating these effects. However, despite the challenge of this dynamism and the availability of an instruction set mechanism that might help address the problem, a key capability missing in the system software stack in modern WSCs is the ability to dynamically transform (and re-transform) the executing application code to apply these instruction set features when necessary. In this work we introduce protean code, a novel approach for enacting arbitrary compiler transformations at runtime for native programs running on commodity hardware with negligible (<;1%) overhead. The fundamental insight behind the underlying mechanism of protean code is that, instead of maintaining full control throughout the program's execution as with traditional dynamic optimizers, protean code allows the original binary to execute continuously and diverts control flow only at a set of virtualized points, allowing rapid and seamless rerouting to the new code variants. In addition, the protean code compiler embeds IR with high-level semantic information into the program, empowering the dynamic compiler to perform rich analysis and transformations online with little overhead. Using a fully functional protean code compiler and runtime built on LLVM, we design PC3D, Protean Code for Cache Contention in Datacenters. PC3D dynamically employs non-temporal access hints to achieve utilization improvements of up to 2.8x (1.5x on average) higher than state-of-the-art contention mitigation runtime techniques at a QoS target of 98%.

由于负载波动、co runner变化和不同程度的干扰而产生的猖獗的动态性对应用程序服务质量(QoS)构成威胁，并限制了我们在现代仓库规模计算机(wsc)中允许共址的能力。指令集特性，如现代isa (ARM和x86)中的非临时内存访问提示，可能有助于减轻这些影响。然而，尽管存在这种动态性的挑战和可能有助于解决问题的指令集机制的可用性，但现代wsc中系统软件堆栈中缺少的一个关键功能是动态转换(和重新转换)正在执行的应用程序代码以在必要时应用这些指令集特性的能力。在这项工作中，我们介绍了protean代码，这是一种在运行时为运行在商用硬件上的本机程序执行任意编译器转换的新方法，开销可以忽略不计(<;1%)。protean代码的基本机制背后的基本见解是，与传统动态优化器在整个程序执行过程中保持完全控制不同，protean代码允许原始二进制文件连续执行，并仅在一组虚拟点上转移控制流，从而允许快速无缝地重新路由到新的代码变体。此外，protean代码编译器将具有高级语义信息的IR嵌入到程序中，使动态编译器能够以很小的开销在线执行丰富的分析和转换。利用一个功能完备的protean代码编译器和基于LLVM的运行时，我们设计了PC3D，数据中心缓存争用的protean代码。PC3D动态地采用非时态访问提示，在98%的QoS目标下，实现比最先进的争用缓解运行时技术高出2.8倍(平均1.5倍)的利用率改进。

{"title":"Protean Code: Achieving Near-Free Online Code Transformations for Warehouse Scale Computers","authors":"M. Laurenzano, Yunqi Zhang, Lingjia Tang, Jason Mars","doi":"10.1109/MICRO.2014.21","DOIUrl":"https://doi.org/10.1109/MICRO.2014.21","url":null,"abstract":"Rampant dynamism due to load fluctuations, co runner changes, and varying levels of interference poses a threat to application quality of service (QoS) and has limited our ability to allow co-locations in modern warehouse scale computers (WSCs). Instruction set features such as the non-temporal memory access hints found in modern ISAs (both ARM and x86) may be useful in mitigating these effects. However, despite the challenge of this dynamism and the availability of an instruction set mechanism that might help address the problem, a key capability missing in the system software stack in modern WSCs is the ability to dynamically transform (and re-transform) the executing application code to apply these instruction set features when necessary. In this work we introduce protean code, a novel approach for enacting arbitrary compiler transformations at runtime for native programs running on commodity hardware with negligible (<;1%) overhead. The fundamental insight behind the underlying mechanism of protean code is that, instead of maintaining full control throughout the program's execution as with traditional dynamic optimizers, protean code allows the original binary to execute continuously and diverts control flow only at a set of virtualized points, allowing rapid and seamless rerouting to the new code variants. In addition, the protean code compiler embeds IR with high-level semantic information into the program, empowering the dynamic compiler to perform rich analysis and transformations online with little overhead. Using a fully functional protean code compiler and runtime built on LLVM, we design PC3D, Protean Code for Cache Contention in Datacenters. PC3D dynamically employs non-temporal access hints to achieve utilization improvements of up to 2.8x (1.5x on average) higher than state-of-the-art contention mitigation runtime techniques at a QoS target of 98%.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"21 1","pages":"558-570"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89390675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Adaptive Cache Management for Energy-Efficient GPU Computing 节能GPU计算的自适应缓存管理

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.11

Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, Wen-mei W. Hwu

With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

对于具有常规内存访问模式的许多应用程序，使用SIMT执行模型，gpu可以通过大规模多线程来隐藏内存延迟。为了支持具有不规则内存访问模式的应用程序，在GPU架构中引入了缓存层次结构，以捕获时间和空间局部性并减轻不规则访问的影响。然而，由于面向吞吐量的执行模型及其缓存层次结构设计的不匹配，GPU缓存表现出较低的效率，这限制了系统性能和能效。GPU产生的大量内存请求导致缓存争用和资源拥塞。现有的为多核系统设计的cpuache管理策略在直接应用于gpucache时可能不是最优的。我们提出了一种专门的gpgpu缓存管理策略。基于重用距离的旁路策略保护缓存层次结构免受争用。在运行时检测争用和资源拥塞。为了避免片上资源过饱和，旁路策略与曲速节流相协调，动态控制曲速的激活数。我们还提出了一个简单的预测器来动态估计可以充分利用缓存空间和片上资源的最佳活动翘曲数量。实验结果表明，该方法显著提高了缓存效率，更好地利用了片上资源进行缓存敏感基准测试。与基线GPU架构和最佳静态warp throttling相比，这分别导致了74%和17%的调和平均IPC改进(最大661%和44% IPC改进)。

{"title":"Adaptive Cache Management for Energy-Efficient GPU Computing","authors":"Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, Wen-mei W. Hwu","doi":"10.1109/MICRO.2014.11","DOIUrl":"https://doi.org/10.1109/MICRO.2014.11","url":null,"abstract":"With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"52 1","pages":"343-355"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74778058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 144

Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks 高效内存虚拟化:降低嵌套页遍历的维数

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.37

Jayneel Gandhi, Arkaprava Basu, M. Hill, M. Swift

Virtualization provides value for many workloads, but its cost rises for workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 memory references on x86-64) rather than a native TLB miss (up to only 4 memory references). The first dimension translates guest virtual addresses to guest physical addresses, while the second translates guest physical addresses to host physical addresses. This paper proposes new hardware using direct segments with three new virtualized modes of operation that significantly speed-up virtualized address translation. Further, this paper proposes two novel techniques to address important limitations of original direct segments. First, self-ballooning reduces fragmentation in physical memory, and addresses the architectural input/output (I/O) gap in x86-64. Second, an escape filter provides alternate translations for exceptional pages within a direct segment (e.g., Physical pages with permanent hard faults). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. One mode -- VMM Direct -- reduces address translation overhead to near-native without guest application or OS changes (2% slower than native on average), while a more aggressive mode -- Dual Direct -- on big-memory workloads performs better-than-native with near-zero translation overhead.

虚拟化为许多工作负载提供了价值，但是对于内存访问局部性差的工作负载，虚拟化的成本会上升。这种开销来自翻译暂置缓冲区(TLB)缺失，其中硬件执行2D页遍历(在x86-64上最多24个内存引用)而不是本机TLB缺失(最多4个内存引用)。第一个维度将来宾虚拟地址转换为来宾物理地址，而第二个维度将来宾物理地址转换为主机物理地址。本文提出了使用直接分段的新硬件和三种新的虚拟化操作模式，显著加快了虚拟地址转换的速度。此外，本文提出了两种新的技术来解决原始直接段的重要局限性。首先，自膨胀减少了物理内存中的碎片，并解决了x86-64中的体系结构输入/输出(I/O)差距。其次，转义过滤器为直接段内的异常页提供替代翻译(例如，具有永久硬故障的物理页)。我们对所提出的硬件进行了仿真，并在x86-64上使用KVM在Linux环境下对软件进行了原型化。一种模式——VMM Direct——在不更改客户应用程序或操作系统的情况下，将地址转换开销降低到接近本机(平均比本机慢2%)，而一种更积极的模式——Dual Direct——在大内存工作负载上的性能优于本机，转换开销接近于零。

{"title":"Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks","authors":"Jayneel Gandhi, Arkaprava Basu, M. Hill, M. Swift","doi":"10.1109/MICRO.2014.37","DOIUrl":"https://doi.org/10.1109/MICRO.2014.37","url":null,"abstract":"Virtualization provides value for many workloads, but its cost rises for workloads with poor memory access locality. This overhead comes from translation look aside buffer (TLB) misses where the hardware performs a 2D page walk (up to 24 memory references on x86-64) rather than a native TLB miss (up to only 4 memory references). The first dimension translates guest virtual addresses to guest physical addresses, while the second translates guest physical addresses to host physical addresses. This paper proposes new hardware using direct segments with three new virtualized modes of operation that significantly speed-up virtualized address translation. Further, this paper proposes two novel techniques to address important limitations of original direct segments. First, self-ballooning reduces fragmentation in physical memory, and addresses the architectural input/output (I/O) gap in x86-64. Second, an escape filter provides alternate translations for exceptional pages within a direct segment (e.g., Physical pages with permanent hard faults). We emulate the proposed hardware and prototype the software in Linux with KVM on x86-64. One mode -- VMM Direct -- reduces address translation overhead to near-native without guest application or OS changes (2% slower than native on average), while a more aggressive mode -- Dual Direct -- on big-memory workloads performs better-than-native with near-zero translation overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"8 Suppl 2 1","pages":"178-189"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73151162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

A Front-End Execution Architecture for High Energy Efficiency 面向高能效的前端执行架构

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.35

Ryota Shioya, M. Goshima, H. Ando

Smart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order superscalar processors as application processors. However, these processors consume much more energy than in-order superscalar processors, because a large amount of energy is consumed by the hardware for dynamic instruction scheduling. We propose a Front-end Execution Architecture (FXA). FXA has two execution units: an out-of-order execution unit (OXU) and an in-order execution unit (IXU). The OXU is the execution core of a common out-of-order superscalar processor. In contrast, the IXU comprises functional units and a bypass network only. The IXU is placed at the processor front end and executes instructions without scheduling. Fetched instructions are first fed to the IXU, and the instructions that are already ready or become ready to execute by the resolution of their dependencies through operand bypassing in the IXU are executed in-order. Not ready instructions go through the IXU as a NOP, thereby, its pipeline is not stalled, and instructions keep flowing. The not-ready instructions are then dispatched to the OXU, and are executed out-of-order. The IXU does not include dynamic scheduling logic, and its energy consumption is consequently small. Evaluation results show that FXA can execute over 50% of instructions using IXU, thereby making it possible to shrink the energy-consuming OXU without incurring performance degradation. As a result, FXA achieves both a high performance and low energy consumption. We evaluated FXA compared with conventional out-of-order/in-order superscalar processors after ARM big. LITTLE architecture. The results show that FXA achieves performance improvements of 67% at the maximum and 7.4% on geometric mean in SPECCPU INT 2006 benchmark suite relative to a conventional superscalar processor (big), while reducing the energy consumption by 86% at the issue queue and 17% in the whole processor. The performance/energy ratio (the inverse of the energy-delay product) of FXA is 25% higher than that of a conventional superscalar processor (big) and 27% higher than that of a conventional in-order superscalar processor (LITTLE).

智能手机和平板电脑最近在电脑市场上变得普遍和占主导地位。用户要求这些移动设备提供高质量的体验和更高的性能。因此，主要开发人员采用无序超标量处理器作为应用程序处理器。然而，这些处理器比有序超标量处理器消耗更多的能量，因为大量的能量被硬件用于动态指令调度。我们提出了一个前端执行架构(FXA)。FXA有两个执行单元:无序执行单元(OXU)和有序执行单元(IXU)。OXU是普通无序超标量处理器的执行核心。相比之下，IXU只包括功能单元和旁路网络。IXU位于处理器前端，无需调度即可执行指令。获取的指令首先被馈送到IXU中，通过IXU中的操作数绕过来解析它们的依赖关系，已经准备好或准备好执行的指令将按顺序执行。未准备好的指令作为NOP通过IXU，因此，它的管道不会停止，指令保持流动。然后将未准备好的指令分派给OXU，并乱序执行。IXU不包含动态调度逻辑，因此能耗小。评估结果表明，FXA可以使用IXU执行超过50%的指令，从而可以在不导致性能下降的情况下缩小消耗能量的OXU。因此，FXA实现了高性能和低能耗。我们比较了常规无序/有序超标量处理器在ARM大之后的FXA。小建筑。结果表明，与传统的超标量处理器(大)相比，FXA在SPECCPU INT 2006基准测试套件中实现了67%的最大性能提升和7.4%的几何平均性能提升，同时在问题队列上降低了86%的能耗，在整个处理器上降低了17%的能耗。FXA的性能/能量比(能量延迟积的倒数)比传统的超标量处理器(big)高25%，比传统的有序超标量处理器(LITTLE)高27%。

{"title":"A Front-End Execution Architecture for High Energy Efficiency","authors":"Ryota Shioya, M. Goshima, H. Ando","doi":"10.1109/MICRO.2014.35","DOIUrl":"https://doi.org/10.1109/MICRO.2014.35","url":null,"abstract":"Smart phones and tablets have recently become widespread and dominant in the computer market. Users require that these mobile devices provide a high-quality experience and an even higher performance. Hence, major developers adopt out-of-order superscalar processors as application processors. However, these processors consume much more energy than in-order superscalar processors, because a large amount of energy is consumed by the hardware for dynamic instruction scheduling. We propose a Front-end Execution Architecture (FXA). FXA has two execution units: an out-of-order execution unit (OXU) and an in-order execution unit (IXU). The OXU is the execution core of a common out-of-order superscalar processor. In contrast, the IXU comprises functional units and a bypass network only. The IXU is placed at the processor front end and executes instructions without scheduling. Fetched instructions are first fed to the IXU, and the instructions that are already ready or become ready to execute by the resolution of their dependencies through operand bypassing in the IXU are executed in-order. Not ready instructions go through the IXU as a NOP, thereby, its pipeline is not stalled, and instructions keep flowing. The not-ready instructions are then dispatched to the OXU, and are executed out-of-order. The IXU does not include dynamic scheduling logic, and its energy consumption is consequently small. Evaluation results show that FXA can execute over 50% of instructions using IXU, thereby making it possible to shrink the energy-consuming OXU without incurring performance degradation. As a result, FXA achieves both a high performance and low energy consumption. We evaluated FXA compared with conventional out-of-order/in-order superscalar processors after ARM big. LITTLE architecture. The results show that FXA achieves performance improvements of 67% at the maximum and 7.4% on geometric mean in SPECCPU INT 2006 benchmark suite relative to a conventional superscalar processor (big), while reducing the energy consumption by 86% at the issue queue and 17% in the whole processor. The performance/energy ratio (the inverse of the energy-delay product) of FXA is 25% higher than that of a conventional superscalar processor (big) and 27% higher than that of a conventional in-order superscalar processor (LITTLE).","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"40 1","pages":"419-431"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81741161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

BuMP: Bulk Memory Access Prediction and Streaming BuMP:批量内存访问预测和流

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.44

Stavros Volos, Javier Picorel, B. Falsafi, Boris Grot

With the end of Den nard scaling, server power has emerged as the limiting factor in the quest for more capable data enters. Without the benefit of supply voltage scaling, it is essential to lower the energy per operation to improve server efficiency. As the industry moves to lean-core server processors, the energy bottleneck is shifting toward main memory as a chief source of server energy consumption in modern data enters. Maximizing the energy efficiency of today's DRAM chips and interfaces requires amortizing the costly DRAM page activations over multiple row buffer accesses. This work introduces Bulk Memory Access Prediction and Streaming, or BuMP. We make the observation that a significant fraction (59-79%) of all memory accesses fall into DRAM pages with high access density, meaning that the majority of their cache blocks will be accessed within a modest time frame of the first access. Accesses to high-density DRAM pages include not only memory reads in response to load instructions, but also reads stemming from store instructions as well as memory writes upon a dirty LLC eviction. The remaining accesses go to low-density pages and virtually unpredictable reference patterns (e.g., Hashed key lookups). BuMP employs a low-cost predictor to identify high-density pages and triggers bulk transfer operations upon the first read or write to the page. In doing so, BuMP enforces high row buffer locality where it is profitable, thereby reducing DRAM energy per access by 23%, and improves server throughput by 11% across a wide range of server applications.

随着Den - nard扩展的结束，服务器功率已经成为寻求更强大的数据输入的限制因素。如果没有电源电压缩放的好处，就必须降低每次操作的能量，以提高服务器效率。随着行业转向精益核心服务器处理器，能源瓶颈正在向主存储器转移，主存储器是现代数据中心服务器能源消耗的主要来源。当前的DRAM芯片和接口的能源效率最大化需要在多个行缓冲区访问上分摊昂贵的DRAM页激活。这项工作引入了批量内存访问预测和流，或BuMP。我们观察到，所有内存访问中有很大一部分(59-79%)落入具有高访问密度的DRAM页面，这意味着它们的大部分缓存块将在第一次访问的适当时间框架内被访问。对高密度DRAM页面的访问不仅包括响应加载指令的内存读取，还包括来自存储指令的读取以及在dirty LLC驱逐时的内存写入。其余的访问将转到低密度页面和几乎不可预测的引用模式(例如，散列键查找)。BuMP使用低成本的预测器来识别高密度页面，并在第一次读取或写入页面时触发批量传输操作。在这样做的过程中，BuMP在有利可图的情况下强制执行高行缓冲区局部性，从而将每次访问的DRAM能量减少23%，并在广泛的服务器应用程序中将服务器吞吐量提高11%。

{"title":"BuMP: Bulk Memory Access Prediction and Streaming","authors":"Stavros Volos, Javier Picorel, B. Falsafi, Boris Grot","doi":"10.1109/MICRO.2014.44","DOIUrl":"https://doi.org/10.1109/MICRO.2014.44","url":null,"abstract":"With the end of Den nard scaling, server power has emerged as the limiting factor in the quest for more capable data enters. Without the benefit of supply voltage scaling, it is essential to lower the energy per operation to improve server efficiency. As the industry moves to lean-core server processors, the energy bottleneck is shifting toward main memory as a chief source of server energy consumption in modern data enters. Maximizing the energy efficiency of today's DRAM chips and interfaces requires amortizing the costly DRAM page activations over multiple row buffer accesses. This work introduces Bulk Memory Access Prediction and Streaming, or BuMP. We make the observation that a significant fraction (59-79%) of all memory accesses fall into DRAM pages with high access density, meaning that the majority of their cache blocks will be accessed within a modest time frame of the first access. Accesses to high-density DRAM pages include not only memory reads in response to load instructions, but also reads stemming from store instructions as well as memory writes upon a dirty LLC eviction. The remaining accesses go to low-density pages and virtually unpredictable reference patterns (e.g., Hashed key lookups). BuMP employs a low-cost predictor to identify high-density pages and triggers bulk transfer operations upon the first read or write to the page. In doing so, BuMP enforces high row buffer locality where it is profitable, thereby reducing DRAM energy per access by 23%, and improves server throughput by 11% across a wide range of server applications.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"73 1","pages":"545-557"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80448995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Managing GPU Concurrency in Heterogeneous Architectures 异构架构下的GPU并发管理

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.62

Onur Kayiran, N. Nachiappan, Adwait Jog, Rachata Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, C. Das

Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications is difficult. We show that GPU applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing GPU-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in GPUs to control the performance of both CPU and GPU applications. This mechanism considers both GPU core state and system-wide memory and network congestion information to dynamically decide on the level of GPU concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of GPU interference, the other (CM-BAL) for improving both CPU and GPU performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average GPU performance by 11%. The second scheme provides 7% average performance improvement for both CPU and GPU applications. We also show that our solution allows the user to control performance trade-offs between CPUs and GPUs.

由通用cpu和吞吐量优化gpu组成的异构架构预计将成为许多应用程序的主要计算平台。这种系统的设计比同构架构的设计更复杂，因为最大化资源利用率同时最小化CPU和GPU应用程序之间的共享资源干扰是困难的。我们展示了GPU应用程序由于其高线程级并行性(TLP)而倾向于垄断共享硬件资源，如内存和网络，并讨论了现有的基于GPU的并发管理技术在异构系统中使用时的局限性。为了解决这个问题，我们提出了一种集成的并发管理策略，该策略通过调节GPU中的TLP来控制CPU和GPU应用程序的性能。该机制考虑GPU核心状态和系统范围内的内存和网络拥塞信息来动态决定GPU并发级别，以最大化系统性能。我们提出并评估了两种方案:一种(CM-CPU)用于在存在GPU干扰的情况下提高CPU性能，另一种(CM-BAL)用于以平衡的方式提高CPU和GPU性能，从而提高整体系统性能。我们的评估表明，第一种方案提高了平均CPU性能24%，同时降低了平均GPU性能11%。第二种方案为CPU和GPU应用程序提供了7%的平均性能提升。我们还展示了我们的解决方案允许用户控制cpu和gpu之间的性能权衡。

{"title":"Managing GPU Concurrency in Heterogeneous Architectures","authors":"Onur Kayiran, N. Nachiappan, Adwait Jog, Rachata Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, C. Das","doi":"10.1109/MICRO.2014.62","DOIUrl":"https://doi.org/10.1109/MICRO.2014.62","url":null,"abstract":"Heterogeneous architectures consisting of general-purpose CPUs and throughput-optimized GPUs are projected to be the dominant computing platforms for many classes of applications. The design of such systems is more complex than that of homogeneous architectures because maximizing resource utilization while minimizing shared resource interference between CPU and GPU applications is difficult. We show that GPU applications tend to monopolize the shared hardware resources, such as memory and network, because of their high thread-level parallelism (TLP), and discuss the limitations of existing GPU-based concurrency management techniques when employed in heterogeneous systems. To solve this problem, we propose an integrated concurrency management strategy that modulates the TLP in GPUs to control the performance of both CPU and GPU applications. This mechanism considers both GPU core state and system-wide memory and network congestion information to dynamically decide on the level of GPU concurrency to maximize system performance. We propose and evaluate two schemes: one (CM-CPU) for boosting CPU performance in the presence of GPU interference, the other (CM-BAL) for improving both CPU and GPU performance in a balanced manner and thus overall system performance. Our evaluations show that the first scheme improves average CPU performance by 24%, while reducing average GPU performance by 11%. The second scheme provides 7% average performance improvement for both CPU and GPU applications. We also show that our solution allows the user to control performance trade-offs between CPUs and GPUs.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"74 1","pages":"114-126"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91341243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 130

Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors 使用ECC反馈来指导低压处理器中的电压推测

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.54

Anys Bacha, R. Teodorescu

Low-voltage computing is emerging as a promising energy-efficient solution to power-constrained environments. Unfortunately, low-voltage operation presents significant reliability challenges, including increased sensitivity to static and dynamic variability. To prevent errors, safety guard bands can be added to the supply voltage. While these guard bands are feasible at higher supply voltages, they are prohibitively expensive at low voltages, to the point of negating most of the energy savings. Voltage speculation techniques have been proposed to dynamically reduce voltage margins. Most require additional hardware to be added to the chip to correct or prevent timing errors caused by excessively aggressive speculation. This paper presents a mechanism for safely guiding voltage speculation using direct feedback from ECC-protected cache lines. We conduct extensive testing of an Intel Itanium processor running at low voltages. We find that as voltage margins are reduced, certain ECC-protected cache lines consistently exhibit correctable errors. We propose a hardware mechanism for continuously probing these cache lines to fine tune supply voltage at core granularity within a chip. Moreover, we demonstrate that this mechanism is sufficiently sensitive to detect and adapt to voltage noise caused by fluctuations in chip activity. We evaluate a proof-of-concept implementation of this mechanism in an Itanium-based server. We show that this solution lowers supply voltage by 18% on average, reducing power consumption by an average of 33% while running a mix of benchmark applications.

低压计算正在成为一种有前途的节能解决方案，适用于功率受限的环境。不幸的是，低压操作对可靠性提出了重大挑战，包括对静态和动态变化的敏感性增加。为了防止错误，可以在电源电压上增加安全防护带。虽然这些保护带在较高的电源电压下是可行的，但在低电压下它们的成本过高，以至于抵消了大部分的节能效果。电压推测技术已经被提出来动态地降低电压余量。大多数需要额外的硬件添加到芯片，以纠正或防止计时错误造成的过度激进的猜测。本文提出了一种利用ecc保护的高速缓存线路的直接反馈来安全引导电压推测的机制。我们对在低电压下运行的英特尔安腾处理器进行了广泛的测试。我们发现，随着电压裕度的降低，某些ecc保护的缓存线始终表现出可纠正的错误。我们提出了一种硬件机制，用于连续探测这些缓存线，以微调芯片内核心粒度的电源电压。此外，我们证明了这种机制足够敏感，可以检测和适应由芯片活性波动引起的电压噪声。我们在基于itanium的服务器上评估了该机制的概念验证实现。我们表明，在运行混合基准应用程序时，该解决方案平均降低了18%的电源电压，平均降低了33%的功耗。

{"title":"Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors","authors":"Anys Bacha, R. Teodorescu","doi":"10.1109/MICRO.2014.54","DOIUrl":"https://doi.org/10.1109/MICRO.2014.54","url":null,"abstract":"Low-voltage computing is emerging as a promising energy-efficient solution to power-constrained environments. Unfortunately, low-voltage operation presents significant reliability challenges, including increased sensitivity to static and dynamic variability. To prevent errors, safety guard bands can be added to the supply voltage. While these guard bands are feasible at higher supply voltages, they are prohibitively expensive at low voltages, to the point of negating most of the energy savings. Voltage speculation techniques have been proposed to dynamically reduce voltage margins. Most require additional hardware to be added to the chip to correct or prevent timing errors caused by excessively aggressive speculation. This paper presents a mechanism for safely guiding voltage speculation using direct feedback from ECC-protected cache lines. We conduct extensive testing of an Intel Itanium processor running at low voltages. We find that as voltage margins are reduced, certain ECC-protected cache lines consistently exhibit correctable errors. We propose a hardware mechanism for continuously probing these cache lines to fine tune supply voltage at core granularity within a chip. Moreover, we demonstrate that this mechanism is sufficiently sensitive to detect and adapt to voltage noise caused by fluctuations in chip activity. We evaluate a proof-of-concept implementation of this mechanism in an Itanium-based server. We show that this solution lowers supply voltage by 18% on average, reducing power consumption by an average of 33% while running a mix of benchmark applications.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"20 1","pages":"306-318"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81495383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

COMP: Compiler Optimizations for Manycore Processors 多核处理器的编译器优化

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Pub Date : 2014-12-13 DOI: 10.1109/MICRO.2014.30

Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar

Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.

在多核处理器上执行的应用程序现在可以很容易地将计算转移到许多核心处理器上，例如Intel Xeon Phi协处理器。然而，调优此类卸载的应用程序以实现高性能执行需要高水平的专业知识和努力。以前的工作集中在优化卸载计算在许多核心处理器上的执行。然而，我们观察到，多核和多核处理器之间的数据传输开销，以及许多核心处理器有限的设备内存，通常会限制通过卸载计算可能获得的性能增益。在本文中，我们介绍了三种源到源的编译器优化，它们可以显著提高将计算任务转移到许多核心处理器上的应用程序的性能。第一个优化自动转换已卸载的代码以启用数据流，这使多核和多核处理器之间的数据传输与这些处理器上的计算重叠，以隐藏数据传输开销。此优化还旨在最大限度地减少许多核心处理器上的内存使用，同时实现最佳性能。第二个编译器优化重新排序计算以规范不规则的内存访问。它支持在许多核心处理器上进行数据流和因式分解，即使原始源代码中的内存访问模式是不规则的。最后，我们新的共享内存机制为在主机和许多核心处理器之间传输大型基于指针的数据结构提供了有效的支持。我们的评估表明，建议的编译器优化在12个基准测试中有9个受益。与简单卸载这些基准测试的原始并行实现相比，我们可以获得1.16 -52.21倍的速度提升。

{"title":"COMP: Compiler Optimizations for Manycore Processors","authors":"Linhai Song, Min Feng, N. Ravi, Yi Yang, S. Chakradhar","doi":"10.1109/MICRO.2014.30","DOIUrl":"https://doi.org/10.1109/MICRO.2014.30","url":null,"abstract":"Applications executing on multicore processors can now easily offload computations to many core processors, such as Intel Xeon Phi coprocessors. However, it requires high levels of expertise and effort to tune such offloaded applications to realize high-performance execution. Previous efforts have focused on optimizing the execution of offloaded computations on many core processors. However, we observe that the data transfer overhead between multicore and many core processors, and the limited device memories of many core processors often constrain the performance gains that are possible by offloading computations. In this paper, we present three source-to-source compiler optimizations that can significantly improve the performance of applications that offload computations to many core processors. The first optimization automatically transforms offloaded codes to enable data streaming, which overlaps data transfer between multicore and many core processors with computations on these processors to hide data transfer overhead. This optimization is also designed to minimize the memory usage on many core processors, while achieving the optimal performance. The second compiler optimization re-orders computations to regularize irregular memory accesses. It enables data streaming and factorization on many core processors, even when the memory access patterns in the original source codes are irregular. Finally, our new shared memory mechanism provides efficient support for transferring large pointer-based data structures between hosts and many core processors. Our evaluation shows that the proposed compiler optimizations benefit 9 out of 12 benchmarks. Compared with simply offloading the original parallel implementations of these benchmarks, we can achieve 1.16x-52.21x speedups.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"52 1","pages":"659-671"},"PeriodicalIF":0.0,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87058466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10