Proceedings of the 40th Annual International Symposium on Computer Architecture最新文献_第3页

QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs QuickRec:一个用于记录和回放多线程程序的英特尔架构扩展原型

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485977

Gilles A. Pokam, Klaus Danne, C. Pereira, R. Kassa, T. Kranich, Shiliang Hu, Justin Emile Gottschlich, N. Honarmand, Nathan Dautenhahn, Samuel T. King, J. Torrellas

There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.

对于多处理器上多线程程序的硬件辅助确定性记录和重放(RnR)系统，人们一直有很大的兴趣。然而，还没有提案在具有完整操作系统支持的硬件原型中实现该技术。需要这样的实施来评估RnR的实用性。本文介绍了QuickRec，第一个多核英特尔架构(IA)的多线程程序RnR原型。QuickRec基于QuickIA, QuickIA是一个英特尔仿真平台，用于快速构建新的IA扩展。QuickRec由Xeon服务器平台与fpga模拟的第二代奔腾内核和Capo3组成，Capo3是一个完整的软件堆栈，用于从修改的Linux内核中管理录制硬件。本文的重点是理解和评估RnR在真实平台上的实现问题。我们的努力得到了一些经验教训，也为未来的研究提供了一些建议。我们证明了RnR可以在真实的多核IA系统上有效地实现。特别是，我们展示了内存日志生成的速率是微不足道的，并且记录硬件的性能开销可以忽略不计。然而，软件堆栈产生了近13%的平均记录开销，必须减少该开销以启用RnR的始终在线使用。

{"title":"QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs","authors":"Gilles A. Pokam, Klaus Danne, C. Pereira, R. Kassa, T. Kranich, Shiliang Hu, Justin Emile Gottschlich, N. Honarmand, Nathan Dautenhahn, Samuel T. King, J. Torrellas","doi":"10.1145/2485922.2485977","DOIUrl":"https://doi.org/10.1145/2485922.2485977","url":null,"abstract":"There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86914739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

SIMD divergence optimization through intra-warp compaction 通过曲内压缩优化SIMD散度

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485954

A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

gpu中的SIMD执行单元越来越多地用于通用应用程序的高性能和节能加速。然而，SIMD控制流发散效应会导致一类GPGPU应用程序的执行效率降低，这些应用程序被归类为发散应用程序。因此，提高SIMD效率有可能为广泛的此类数据并行应用带来显著的性能和能源优势。最近，SIMD散度问题受到了越来越多的关注，并且已经提出了几种微体系结构技术来解决这个问题的各个方面。然而，这些技术通常相当复杂，因此不太可能用于实际实现。在本文中，我们提出了两种针对GPGPU架构的微架构优化，当指令流中存在特定的关闭通道组时，它们利用相对简单的执行周期压缩技术。我们将这些优化分别称为基本循环压缩(BCC)和混合循环压缩(SCC)。在本文中，我们将概述在所研究的GPGPU架构上下文中实现这些优化的附加要求。我们对来自OpenCL (GPGPU)和OpenGL(图形)应用程序的不同SIMD工作负载的评估表明，BCC和SCC将不同应用程序的执行周期缩短了42%(平均20%)。对于不同工作负载的子集，当前gpu的执行时间平均减少了7%，而具有更好内存子系统的未来gpu的执行时间平均减少了18%。我们工作的关键贡献在于简化了微架构，以提供发散优化，同时提供了更复杂方法的大部分好处。

{"title":"SIMD divergence optimization through intra-warp compaction","authors":"A. S. Vaidya, A. Shayesteh, Dong Hyuk Woo, Roy Saharoy, M. Azimi","doi":"10.1145/2485922.2485954","DOIUrl":"https://doi.org/10.1145/2485922.2485954","url":null,"abstract":"SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87117889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates ArchShield:通过容忍高错误率来协助DRAM扩展的架构框架

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485929

Prashant J. Nair, Dae-Hyun Kim, Moinuddin K. Qureshi

DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates. To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.

在过去的三十年里，DRAM的扩展一直是增加主存储系统容量的主要驱动力。不幸的是，由于设计更小的几何形状固有的困难，加上器件变化和泄漏问题，将DRAM扩展到更小的技术节点已经变得具有挑战性。未来的DRAM设备可能会经历非常高的错误率。能够有效容错的技术可以使DRAM扩展到更小的技术节点。然而，现有的技术，如行/列节省和ECC，在高错误率时变得令人望而却步。为了开发具有成本效益的解决方案来容忍高错误率，本文提倡采用跨层方法。我们不是将错误单元信息隐藏在DRAM芯片中，而是将其暴露在体系结构级别。我们提出ArchShield，一个架构框架，采用运行时测试来识别故障的DRAM单元。ArchShield使用两个组件来容忍这些错误，一个是将错误单词的信息保存在高速缓存线上的故障映射，另一个是复制错误单词以实现错误恢复的选择性单词级复制(SWLR)。Fault Map和SWLR都集成在DRAM内存的预留区域中。我们对8GB DRAM DIMM的评估表明，ArchShield可以有效地容忍错误率高达10 - 4(比单独的ECC高100倍)，导致不到2%的性能下降，并且仍然保持对软错误的1位容错。

{"title":"ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates","authors":"Prashant J. Nair, Dae-Hyun Kim, Moinuddin K. Qureshi","doi":"10.1145/2485922.2485929","DOIUrl":"https://doi.org/10.1145/2485922.2485929","url":null,"abstract":"DRAM scaling has been the prime driver for increasing the capacity of main memory system over the past three decades. Unfortunately, scaling DRAM to smaller technology nodes has become challenging due to the inherent difficulty in designing smaller geometries, coupled with the problems of device variation and leakage. Future DRAM devices are likely to experience significantly high error-rates. Techniques that can tolerate errors efficiently can enable DRAM to scale to smaller technology nodes. However, existing techniques such as row/column sparing and ECC become prohibitive at high error-rates. To develop cost-effective solutions for tolerating high error-rates, this paper advocates a cross-layer approach. Rather than hiding the faulty cell information within the DRAM chips, we expose it to the architectural level. We propose ArchShield, an architectural framework that employs runtime testing to identify faulty DRAM cells. ArchShield tolerates these faults using two components, a Fault Map that keeps information about faulty words in a cache line, and Selective Word-Level Replication (SWLR) that replicates faulty words for error resilience. Both Fault Map and SWLR are integrated in reserved area in DRAM memory. Our evaluations with 8GB DRAM DIMM show that ArchShield can efficiently tolerate error-rates as higher as 10−4 (100x higher than ECC alone), causes less than 2% performance degradation, and still maintains 1-bit error tolerance against soft errors.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"224 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89043343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 170

Proceedings of the 40th Annual International Symposium on Computer Architecture 第40届计算机体系结构国际研讨会论文集

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922

A. Mendelson

引用次数: 4

Criticality stacks: identifying critical threads in parallel programs using synchronization behavior 临界栈:在使用同步行为的并行程序中识别临界线程

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485966

Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, L. Eeckhout

Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware. In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance. To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.

分析多线程程序是相当具有挑战性的，但是为了在节省能源的同时获得良好的多核性能是必要的。由于同步，某些线程使其他线程等待，因为它们持有锁或尚未到达屏障。我们称这些线程为关键线程，即其性能决定整个程序性能的线程。识别这些线程可以为软件开发人员和硬件开发人员揭示许多优化机会。在本文中，我们提出了一个评估线程临界性的新度量，它结合了线程执行有用工作的时间和协同运行的线程等待的时间。我们将展示如何在少量硬件添加和低开销的情况下在线计算线程临界性。我们使用我们的度量来创建临界堆栈，将总执行时间分解为每个线程的临界组件，从而可以轻松地可视化分析并行不平衡。为了验证我们的关键性指标，并证明它比以前的指标更好，我们缩放了最关键线程的频率，并证明它实现了最大的性能改进。然后，我们通过使用临界堆栈执行三种类型的优化来展示临界堆栈的广泛适用性:(1)程序分析以消除并行瓶颈，(2)动态识别最关键的线程并使用频率缩放来加速它以提高性能，以及(3)显示仅加速最关键的线程可以实现目标能耗降低。

{"title":"Criticality stacks: identifying critical threads in parallel programs using synchronization behavior","authors":"Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, L. Eeckhout","doi":"10.1145/2485922.2485966","DOIUrl":"https://doi.org/10.1145/2485922.2485966","url":null,"abstract":"Analyzing multi-threaded programs is quite challenging, but is necessary to obtain good multicore performance while saving energy. Due to synchronization, certain threads make others wait, because they hold a lock or have yet to reach a barrier. We call these critical threads, i.e., threads whose performance is determinative of program performance as a whole. Identifying these threads can reveal numerous optimization opportunities, for the software developer and for hardware. In this paper, we propose a new metric for assessing thread criticality, which combines both how much time a thread is performing useful work and how many co-running threads are waiting. We show how thread criticality can be calculated online with modest hardware additions and with low overhead. We use our metric to create criticality stacks that break total execution time into each thread's criticality component, allowing for easy visual analysis of parallel imbalance. To validate our criticality metric, and demonstrate it is better than previous metrics, we scale the frequency of the most critical thread and show it achieves the largest performance improvement. We then demonstrate the broad applicability of criticality stacks by using them to perform three types of optimizations: (1) program analysis to remove parallel bottlenecks, (2) dynamically identifying the most critical thread and accelerating it using frequency scaling to improve performance, and (3) showing that accelerating only the most critical thread allows for targeted energy reduction.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75677757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

Improving virtualization in the presence of software managed translation lookaside buffers 在存在软件管理的翻译暂存缓冲区的情况下改进虚拟化

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485933

Xiaotao Chang, H. Franke, Y. Ge, Tao Liu, Kun Wang, J. Xenidis, Fei Chen, Yu Zhang

Virtualization has become an important technology that is used across many platforms, particularly servers, to increase utilization, multi-tenancy and security. Virtualization introduces additional overhead that often relates to memory management, interrupt handling and hypervisor mode switching. Among those, memory management and translation lookaside buffer (TLB) management have been shown to have a significant impact on the performance of systems. Two principal mechanisms for TLB management exist in today's systems, namely software and hardware managed TLBs. In this paper, we analyze and quantify the overhead of a pure software virtualization that is implemented over a software managed TLB. We then describe our design of hardware extensions to support virtualization in systems with software managed TLBs to remove the most dominant overheads. These extensions were implemented in the Power embedded A2 core, which is used in the PowerEN and in the Blue Gene/Q processors. They were used to implement a KVM port. We evaluate each of these hardware extensions to determine their overall contributions to performance and efficiency. Collectively these extensions demonstrate an average improvement of 232% over a pure software implementation.

虚拟化已经成为跨许多平台(特别是服务器)使用的重要技术，可以提高利用率、多租户和安全性。虚拟化引入了额外的开销，通常与内存管理、中断处理和管理程序模式切换有关。其中，内存管理和翻译暂置缓冲区(TLB)管理已被证明对系统性能有重大影响。在当今的系统中存在两种主要的TLB管理机制，即软件管理的TLB和硬件管理的TLB。在本文中，我们分析并量化了在软件管理的TLB上实现的纯软件虚拟化的开销。然后，我们描述了硬件扩展的设计，以支持使用软件管理的tlb的系统中的虚拟化，从而消除最主要的开销。这些扩展是在PowerEN和Blue Gene/Q处理器中使用的Power嵌入式A2内核中实现的。它们被用来实现KVM端口。我们对每种硬件扩展进行评估，以确定它们对性能和效率的总体贡献。总的来说，这些扩展比纯软件实现平均提高了232%。

{"title":"Improving virtualization in the presence of software managed translation lookaside buffers","authors":"Xiaotao Chang, H. Franke, Y. Ge, Tao Liu, Kun Wang, J. Xenidis, Fei Chen, Yu Zhang","doi":"10.1145/2485922.2485933","DOIUrl":"https://doi.org/10.1145/2485922.2485933","url":null,"abstract":"Virtualization has become an important technology that is used across many platforms, particularly servers, to increase utilization, multi-tenancy and security. Virtualization introduces additional overhead that often relates to memory management, interrupt handling and hypervisor mode switching. Among those, memory management and translation lookaside buffer (TLB) management have been shown to have a significant impact on the performance of systems. Two principal mechanisms for TLB management exist in today's systems, namely software and hardware managed TLBs. In this paper, we analyze and quantify the overhead of a pure software virtualization that is implemented over a software managed TLB. We then describe our design of hardware extensions to support virtualization in systems with software managed TLBs to remove the most dominant overheads. These extensions were implemented in the Power embedded A2 core, which is used in the PowerEN and in the Blue Gene/Q processors. They were used to implement a KVM port. We evaluate each of these hardware extensions to determine their overall contributions to performance and efficiency. Collectively these extensions demonstrate an average improvement of 232% over a pure software implementation.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"244 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84622717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

AC-DIMM: associative computing with STT-MRAM AC-DIMM:与STT-MRAM关联计算

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485939

Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, E. Friedman

With technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the face of bandwidth and power limitations is to rely on associative memory systems. Recent work on a PCM-based, associative TCAM accelerator shows that associative search capability can reduce both off-chip bandwidth demand and overall system energy. Unfortunately, previously proposed resistive TCAM accelerators have limited flexibility: only a restricted (albeit important) class of applications can benefit from a TCAM accelerator, and the implementation is confined to resistive memory technologies with a high dynamic range (RHigh/RLow), such as PCM. This work proposes AC-DIMM, a flexible, high-performance associative compute engine built on a DDR3-compatible memory module. AC-DIMM addresses the limited flexibility of previous resistive TCAM accelerators by combining two powerful capabilities---associative search and processing in memory. Generality is improved by augmenting a TCAM system with a set of integrated, user programmable microcontrollers that operate directly on search results, and by architecting the system such that key-value pairs can be co-located in the same TCAM row. A new, bit-serial TCAM array is proposed, which enables the system to be implemented using STT-MRAM. AC-DIMM achieves a 4.2X speedup and a 6.5X energy reduction over a conventional RAM-based system on a set of 13 evaluated applications.

随着技术的发展，片上功耗和片外内存带宽已经成为从移动设备到超级计算机等几乎所有计算机系统的重要性能瓶颈。面对带宽和功率限制，提高性能的有效方法是依赖关联存储系统。最近对基于pcm的关联TCAM加速器的研究表明，关联搜索功能可以减少片外带宽需求和整体系统能量。不幸的是，以前提出的电阻式TCAM加速器具有有限的灵活性:只有有限(尽管重要)的应用类别可以从TCAM加速器中受益，并且实现仅限于具有高动态范围(RHigh/RLow)的电阻式存储技术，例如PCM。这项工作提出了AC-DIMM，一个灵活的，高性能的关联计算引擎，建立在ddr3兼容的内存模块上。AC-DIMM通过结合两种强大的功能-关联搜索和内存处理，解决了以前电阻式TCAM加速器的有限灵活性。通过使用一组集成的、用户可编程的微控制器来增加TCAM系统，这些微控制器直接对搜索结果进行操作，并且通过构建系统使键值对可以在同一TCAM行中共存来提高通用性。提出了一种新的位串行TCAM阵列，使系统能够使用STT-MRAM实现。在一组13个评估应用中，AC-DIMM比传统的基于ram的系统实现了4.2倍的加速和6.5倍的节能。

{"title":"AC-DIMM: associative computing with STT-MRAM","authors":"Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, E. Friedman","doi":"10.1145/2485922.2485939","DOIUrl":"https://doi.org/10.1145/2485922.2485939","url":null,"abstract":"With technology scaling, on-chip power dissipation and off-chip memory bandwidth have become significant performance bottlenecks in virtually all computer systems, from mobile devices to supercomputers. An effective way of improving performance in the face of bandwidth and power limitations is to rely on associative memory systems. Recent work on a PCM-based, associative TCAM accelerator shows that associative search capability can reduce both off-chip bandwidth demand and overall system energy. Unfortunately, previously proposed resistive TCAM accelerators have limited flexibility: only a restricted (albeit important) class of applications can benefit from a TCAM accelerator, and the implementation is confined to resistive memory technologies with a high dynamic range (RHigh/RLow), such as PCM. This work proposes AC-DIMM, a flexible, high-performance associative compute engine built on a DDR3-compatible memory module. AC-DIMM addresses the limited flexibility of previous resistive TCAM accelerators by combining two powerful capabilities---associative search and processing in memory. Generality is improved by augmenting a TCAM system with a set of integrated, user programmable microcontrollers that operate directly on search results, and by architecting the system such that key-value pairs can be co-located in the same TCAM row. A new, bit-serial TCAM array is proposed, which enables the system to be implemented using STT-MRAM. AC-DIMM achieves a 4.2X speedup and a 6.5X energy reduction over a conventional RAM-based system on a set of 13 evaluated applications.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"1 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72618512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

Secure I/O device sharing among virtual machines on multiple hosts 多台主机上的虚拟机之间的安全I/O设备共享

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485932

Cheng-Chun Tu, Chao-Tang Lee, T. Chiueh

Virtualization allows flexible mappings between physical resources and virtual entities, and improves allocation efficiency and agility. Unfortunately, most existing virtualization technologies are limited to resources in a single host. This paper presents the design, implementation and evaluation of a multi-host I/O device virtualization system called Ladon, which enables I/O devices to be shared among virtual machines running on multiple hosts in a secure and efficient way. Specifically, Ladon uses a PCIe network to connect multiple servers with PCIe devices and allows VMs running on these servers to directly interact with these PCIe devices without interfering with one another. Through an evaluation of a fully operational Ladon prototype, we show that there is no throughput and latency penalty of the multi-host I/O virtualization enabled by Ladon compared to those of the existing single-host I/O virtualization technology.

虚拟化允许物理资源和虚拟实体之间的灵活映射，提高分配效率和敏捷性。不幸的是，大多数现有的虚拟化技术仅限于单个主机上的资源。本文介绍了一个多主机I/O设备虚拟化系统Ladon的设计、实现和评价，该系统能使运行在多主机上的虚拟机安全、高效地共享I/O设备。具体来说，Ladon使用PCIe网络将多个服务器与PCIe设备连接起来，并允许运行在这些服务器上的虚拟机直接与这些PCIe设备交互，而不会相互干扰。通过对一个完全可操作的Ladon原型的评估，我们表明，与现有的单主机I/O虚拟化技术相比，Ladon支持的多主机I/O虚拟化没有吞吐量和延迟损失。

引用次数: 22

LINQits: big data on little clients linqit:小客户端的大数据

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485945

Eric S. Chung, John D. Davis, Jaewon Lee

We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query language called LINQ. LINQits does not provide coverage for all possible applications---however, existing applications (re-)written with LINQ in mind benefit extensively from hardware acceleration. Furthermore, the LINQits framework offers a graceful and transparent migration path from software to hardware. LINQits is prototyped on a 2W heterogeneous SoC called the ZYNQ processor, which combines dual ARM A9 processors with an FPGA on a single die in 28nm silicon technology. Our physical measurements show that LINQits improves energy efficiency by 8.9 to 30.6 times and performance by 10.7 to 38.1 times compared to optimized, multithreaded C programs running on conventional ARM A9 processors.

我们提出了LINQits，一个灵活的硬件模板，可以映射到可编程逻辑或asic在一个异构系统片上的移动设备或服务器。与固定函数加速器不同，LINQits加速特定于领域的查询语言LINQ。LINQits并没有覆盖所有可能的应用程序——然而，用LINQ编写的现有应用程序(重新编写)从硬件加速中广泛受益。此外，LINQits框架提供了从软件到硬件的优美和透明的迁移路径。LINQits的原型基于2W异构SoC(称为ZYNQ处理器)，该处理器将双ARM A9处理器和FPGA结合在单一芯片上，采用28nm硅技术。我们的物理测量表明，与运行在传统ARM A9处理器上的优化多线程C程序相比，LINQits的能效提高了8.9到30.6倍，性能提高了10.7到38.1倍。

引用次数: 102

Orchestrated scheduling and prefetching for GPGPUs 对gpgpu进行编排调度和预取

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485951

Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, R. Iyer, C. Das

In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future "far-ahead" warp while executing the current warp. We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.

在本文中，我们提出了在通用图形处理单元(GPGPU)架构中协调线程调度和预取决策的技术，以更好地容忍长内存延迟。我们证明了GPGPU架构中现有的warp调度策略无法有效地结合数据预取。主要原因是它们安排了连续的翘曲，这些翘曲可能会访问附近的缓存块，从而在连续的周期中精确地相互预取。这要么1)导致预取产生的时间太接近于它们对应的地址被另一个warp实际需要的时间，要么2)需要复杂的预取器设计来正确预测未来“远超前”warp所需的地址，同时执行当前warp。为了克服这些问题，我们提出了一种新的具有预取意识的warp调度策略。关键思想是在时间上分离连续经线的调度，这样它们就不会背靠背地执行。我们表明，该策略不仅使简单的预取器能够有效地容忍内存延迟，而且还提高了内存库的并行性，即使在不使用预取的情况下也是如此。在30核模拟GPGPU平台上对各种应用程序进行的实验评估表明，预取感知的warp调度器比分别与常用的轮询调度器或最近提出的两级warp调度器一起使用预取的基准提供了25%和7%的平均性能提高。此外，当不使用预取时，感知预取的warp调度器比这两个基准调度器提供更高的性能，因为它更好地利用了内存库的并行性。

{"title":"Orchestrated scheduling and prefetching for GPGPUs","authors":"Adwait Jog, Onur Kayiran, Asit K. Mishra, M. Kandemir, O. Mutlu, R. Iyer, C. Das","doi":"10.1145/2485922.2485951","DOIUrl":"https://doi.org/10.1145/2485922.2485951","url":null,"abstract":"In this paper, we present techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies. We demonstrate that existing warp scheduling policies in GPGPU architectures are unable to effectively incorporate data prefetching. The main reason is that they schedule consecutive warps, which are likely to access nearby cache blocks and thus prefetch accurately for one another, back-to-back in consecutive cycles. This either 1) causes prefetches to be generated by a warp too close to the time their corresponding addresses are actually demanded by another warp, or 2) requires sophisticated prefetcher designs to correctly predict the addresses required by a future \"far-ahead\" warp while executing the current warp. We propose a new prefetch-aware warp scheduling policy that overcomes these problems. The key idea is to separate in time the scheduling of consecutive warps such that they are not executed back-to-back. We show that this policy not only enables a simple prefetcher to be effective in tolerating memory latencies but also improves memory bank parallelism, even when prefetching is not employed. Experimental evaluations across a diverse set of applications on a 30-core simulated GPGPU platform demonstrate that the prefetch-aware warp scheduler provides 25% and 7% average performance improvement over baselines that employ prefetching in conjunction with, respectively, the commonly-employed round-robin scheduler or the recently-proposed two-level warp scheduler. Moreover, when prefetching is not employed, the prefetch-aware warp scheduler provides higher performance than both of these baseline schedulers as it better exploits memory bank parallelism.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79513713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 194