2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献_第5页

PrORAM: Dynamic prefetcher for Oblivious RAM proam:无意识RAM的动态预取器

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750413

Xiangyao Yu, Syed Kamran Haider, Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, S. Devadas

Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern. Our simulation results show that with PrORAM, the performance of ORA M can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORA M for memory intensive benchmarks among Splash2 and 5.5% for SP EC06 workloads. The peiformance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.

遗忘RAM (ORAM)是一种成熟的技术，用于隐藏对不可信存储系统的访问模式。使用ORAM，好奇的对手在观察用户和存储系统之间移动的比特时，无法知道用户正在访问哪个地址。现有的所有ORAM方案都是通过在存储系统中增加冗余来实现遗忘的，即每次访问都变成多个随机访问。这种冗余会导致很大的性能开销。尽管传统的数据预取技术成功地隐藏了基于DRAM的系统中的内存延迟，但事实证明，它们不适用于ORAM，因为ORAM没有足够的内存带宽来发出预取请求。在本文中，我们利用ORAM的内部结构来利用ORAM的局部性。虽然遗忘和局部性是两个相互矛盾的概念，但我们通过在不牺牲安全性的情况下利用ORAM中的数据局部性来挑战这种直觉。特别提出了一种动态ORAM预取技术，称为proam (dynamic Prefetcher for ORAM)，并对其设计空间进行了全面探索。proam在运行时检测程序中的数据局部性，并在不泄露任何访问模式信息的情况下利用局部性。仿真结果表明，使用proam可以显著提高ORAM的性能。在Splash2的内存密集型基准测试中，prooram比基准ORAM的平均性能提高了20%，在SP EC06工作负载中提高了5.5%。在DBMS基准测试中，YCSB和TPCC的性能增益分别为23.6%和5%。平均而言，proam提供的性能增益是静态超级块方案的两倍。

{"title":"PrORAM: Dynamic prefetcher for Oblivious RAM","authors":"Xiangyao Yu, Syed Kamran Haider, Ling Ren, Christopher W. Fletcher, Albert Kwon, Marten van Dijk, S. Devadas","doi":"10.1145/2749469.2750413","DOIUrl":"https://doi.org/10.1145/2749469.2750413","url":null,"abstract":"Oblivious RAM (ORAM) is an established technique to hide the access pattern to an untrusted storage system. With ORAM, a curious adversary cannot tell what address the user is accessing when observing the bits moving between the user and the storage system. All existing ORAM schemes achieve obliviousness by adding redundancy to the storage system, i.e., each access is turned into multiple random accesses. Such redundancy incurs a large performance overhead.Although traditional data prefetching techniques successfully hide memory latency in DRAM based systems, it turns out that they do not work well for ORAM because ORAM does not have enough memory bandwidth available for issuing prefetch requests. In this paper, we exploit ORAM locality by taking advantage of the ORAM internal structures. While it might seem apparent that obliviousness and locality are two contradictory concepts, we challenge this intuition by exploiting data locality in ORAM without sacrificing security. In particular, we propose a dynamic ORAM prefetching technique called PrORAM (Dynamic Prefetcher for ORAM) and comprehensively explore its design space. PrORAM detects data locality in programs at runtime, and exploits the locality without leaking any information on the access pattern. Our simulation results show that with PrORAM, the performance of ORA M can be significantly improved. PrORAM achieves an average performance gain of 20% over the baseline ORA M for memory intensive benchmarks among Splash2 and 5.5% for SP EC06 workloads. The peiformance gain for YCSB and TPCC in DBMS benchmarks is 23.6% and 5% respectively. On average, PrORAM offers twice the performance gain than that offered by a static super block scheme.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"42 1","pages":"616-628"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89572583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

MiSAR: Minimalistic synchronization accelerator with resource overflow management MiSAR:极简同步加速器与资源溢出管理

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750396

Ching-Kai Liang, Milos Prvulović

While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables. This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.

虽然已经提出了许多硬件同步机制，但当它们的硬件资源超过时，它们要么不再起作用，要么遭受巨大的性能损失，要么增加了处理此类资源溢出的复杂性和成本。此外，以前的硬件同步建议只关注一种类型的同步(屏障或锁)，因此可能需要几种机制来支持实际应用程序，其中许多应用程序使用锁、屏障和/或条件变量。本文提出了MiSAR，一种支持所有三种常用同步类型(锁、屏障和条件变量)的极简同步加速器(MSA)，以及一种动态管理其(非常)有限的硬件同步资源的新型溢出管理单元(OMU)。OMU允许在使用硬件(MSA)和软件同步实现之间安全有效的动态转换。这允许MSA的资源仅用于当前活动的同步操作，即使程序中使用的同步变量的数量远远大于MSA的资源，也能提供显著的性能优势。因为它允许硬件和软件同步之间的安全转换，所以OMU还有助于线程挂起/恢复、迁移和其他线程管理活动。最后，MSA/OMU组合将指令集支持(程序如何调用硬件支持的同步)与加速器的实际实现解耦，允许在未来使用不同的加速器(甚至完全删除加速器)，而无需更改与OMU兼容的应用程序或系统代码。我们表明，即使在每个tile中只有2个MSA条目，MSA/OMU组合的平均执行速度也在理想(零延迟)同步的3%以内，并且比软件(pthreads)实现的速度提高了1.43倍。

{"title":"MiSAR: Minimalistic synchronization accelerator with resource overflow management","authors":"Ching-Kai Liang, Milos Prvulović","doi":"10.1145/2749469.2750396","DOIUrl":"https://doi.org/10.1145/2749469.2750396","url":null,"abstract":"While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables. This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"2 1","pages":"414-426"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86551915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

HEB: Deploying and managing hybrid energy buffers for improving datacenter efficiency and economy HEB:部署和管理混合能源缓冲，以提高数据中心的效率和经济性

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750384

Longjun Liu, Chao Li, Hongbin Sun, Yang Hu, Juncheng Gu, Tao Li, J. Xin, Nanning Zheng

Today, an increasing number of applications and services are being hosted by large-scale data centers. The massive and irregular load surges challenge data center power infrastructures. As a result, power mismatching between supply and demand has emerged as a crucial issue in modern data centers which are either under-provisioned or powered by intermittent power sources. Recent proposals have employed energy storage devices such as the uninterruptible power supply (UPS) systems to address this issue. However, current approaches lack the capacity of efficiently handling the irregular and unpredictable power mismatches. In this paper, we propose Hybrid Energy Buffering (HEB), the first heterogeneous and adaptive strategy that incorporates super-capacitors (SCs) into existing data centers to dynamically deal with power mismatches. Our techniques exploit diverse energy absorbing characteristics and intelligent load assignment policies to provide efficiency-and scenario- aware power mismatch management. More attractively, our management schemes make the costly energy storage devices more affordable and economical for datacenter-scale usage. We evaluate the HEB design with a real system prototype. Compared with a homogenous battery energy buffering system, HEB could improve energy efficiency by 39.7%, extend UPS lifetime by 4.7×, reduce system downtime by 41% and improve renewable energy utilization by 81.2%. Our TCO analysis shows that HEB manifests high ROI and is able to gain more than 1.9× peak shaving benefit during an 8-years period. It allows datacenters to adapt to various power supply anomalies, thereby improving operational efficiency, resiliency and economy.

如今，越来越多的应用程序和服务由大型数据中心托管。数据中心电力基础设施面临着巨大且不规则的负载激增挑战。因此，供应和需求之间的电力不匹配已经成为现代数据中心的一个关键问题，这些数据中心要么供应不足，要么由间歇性电源供电。最近的建议采用储能设备，如不间断电源(UPS)系统来解决这个问题。然而，目前的方法缺乏有效处理不规则和不可预测的功率不匹配的能力。在本文中，我们提出了混合能量缓冲(HEB)，这是第一个异构和自适应策略，该策略将超级电容器(SCs)集成到现有数据中心中以动态处理功率不匹配。我们的技术利用不同的能量吸收特性和智能负载分配策略来提供效率和场景感知的功率失配管理。更吸引人的是，我们的管理方案使昂贵的能量存储设备更经济实惠，适合数据中心规模的使用。我们用一个真实的系统原型来评估HEB设计。与同质电池能量缓冲系统相比，HEB可提高39.7%的能源效率，延长UPS寿命4.7倍，减少41%的系统停机时间，提高81.2%的可再生能源利用率。我们的TCO分析表明，HEB具有较高的投资回报率，并且能够在8年期间获得超过1.9倍的峰值收益。它使数据中心能够适应各种供电异常，从而提高运营效率、弹性和经济性。

{"title":"HEB: Deploying and managing hybrid energy buffers for improving datacenter efficiency and economy","authors":"Longjun Liu, Chao Li, Hongbin Sun, Yang Hu, Juncheng Gu, Tao Li, J. Xin, Nanning Zheng","doi":"10.1145/2749469.2750384","DOIUrl":"https://doi.org/10.1145/2749469.2750384","url":null,"abstract":"Today, an increasing number of applications and services are being hosted by large-scale data centers. The massive and irregular load surges challenge data center power infrastructures. As a result, power mismatching between supply and demand has emerged as a crucial issue in modern data centers which are either under-provisioned or powered by intermittent power sources. Recent proposals have employed energy storage devices such as the uninterruptible power supply (UPS) systems to address this issue. However, current approaches lack the capacity of efficiently handling the irregular and unpredictable power mismatches. In this paper, we propose Hybrid Energy Buffering (HEB), the first heterogeneous and adaptive strategy that incorporates super-capacitors (SCs) into existing data centers to dynamically deal with power mismatches. Our techniques exploit diverse energy absorbing characteristics and intelligent load assignment policies to provide efficiency-and scenario- aware power mismatch management. More attractively, our management schemes make the costly energy storage devices more affordable and economical for datacenter-scale usage. We evaluate the HEB design with a real system prototype. Compared with a homogenous battery energy buffering system, HEB could improve energy efficiency by 39.7%, extend UPS lifetime by 4.7×, reduce system downtime by 41% and improve renewable energy utilization by 81.2%. Our TCO analysis shows that HEB manifests high ROI and is able to gain more than 1.9× peak shaving benefit during an 8-years period. It allows datacenters to adapt to various power supply anomalies, thereby improving operational efficiency, resiliency and economy.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"29 1","pages":"463-475"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80503887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Hi-fi playback: Tolerating position errors in shift operations of racetrack memory 高保真回放:容忍赛道存储器移位操作中的位置错误

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750388

Chao Zhang, Guangyu Sun, Xian Zhang, Weiqi Zhang, Weisheng Zhao, Tao Wang, Yun Liang, Yongpan Liu, Yu Wang, J. Shu

Racetrack memory is an emerging non-volatile memory based on spintronic domain wall technology. It can achieve ultra-high storage density. Also, its read/write speed is comparable to that of SRAM. Due to the tape-like structure of its storage cell, a “shift” operation is introduced to access racetrack memory. Thus, prior research mainly focused on minimizing shift latency/energy of racetrack memory while leveraging its ultra-high storage density. Yet the reliability issue of a shift operation, however, is not well addressed. In fact, racetrack memory suffers from unsuccessful shift due to domain misalignment. Such a problem is called “position error” in this work. It can significantly reduce mean-time-to-failure (MTTF) of racetrack memory to an intolerable level. Even worse, conventional error correction codes (ECCs), which are designed for “bit errors”, cannot protect racetrack memory from the position errors. In this work, we investigate the position error model of a shift operation and categorize position errors into two types: “stop-in-middle” error and “out-of-step” error. To eliminate the stop-in-middle error, we propose a technique called sub-threshold shift (STS) to perform a more reliable shift in two stages. To detect and recover the out-of-step error, a protection mechanism called position error correction code (p-ECC) is proposed. We first describe how to design a p-ECC for different protection strength and analyze corresponding design overhead. Then, we further propose how to reduce area cost of p-ECC by leveraging the “overhead region” in a racetrack memory stripe. With these protection mechanisms, we introduce a position-error-aware shift architecture. Experimental results demonstrate that, after using our techniques, the overall MTTF of racetrack memory is improved from 1.33μs to more than 69 years, with only 0.2% performance degradation. Trade-off among reliability, area, performance, and energy is also explored with comprehensive discussion.

赛马场存储器是一种基于自旋电子畴壁技术的新兴非易失性存储器。可实现超高存储密度。而且，它的读/写速度与SRAM相当。由于其存储单元的磁带状结构，引入了“移位”操作来访问赛马场存储器。因此，以往的研究主要集中在最小化赛道存储器的移位延迟/能量，同时利用其超高的存储密度。然而，移位操作的可靠性问题并没有得到很好的解决。事实上，赛马场存储器由于域错位而遭受不成功的移位。这种问题在本文中称为“位置误差”。它可以显着将赛道内存的平均故障时间(MTTF)降低到无法忍受的水平。更糟糕的是，传统的纠错码(ecc)，设计用于“位错误”，不能保护赛道存储器免受位置错误的影响。在这项工作中，我们研究了移位操作的位置误差模型，并将位置误差分为两种类型:“中途停止”误差和“不同步”误差。为了消除中途停止误差，我们提出了一种称为亚阈值移位(STS)的技术，该技术分两个阶段进行更可靠的移位。为了检测和恢复失步错误，提出了一种位置纠错码(p-ECC)保护机制。我们首先描述了如何设计不同保护强度的p-ECC，并分析了相应的设计开销。然后，我们进一步提出如何利用赛道内存条中的“开销区域”来降低p-ECC的面积成本。利用这些保护机制，我们引入了位置错误感知移位架构。实验结果表明，采用我们的技术后，赛道记忆的总体MTTF从1.33μs提高到69年以上，性能仅下降0.2%。在可靠性、面积、性能和能源之间的权衡也进行了全面的讨论。

{"title":"Hi-fi playback: Tolerating position errors in shift operations of racetrack memory","authors":"Chao Zhang, Guangyu Sun, Xian Zhang, Weiqi Zhang, Weisheng Zhao, Tao Wang, Yun Liang, Yongpan Liu, Yu Wang, J. Shu","doi":"10.1145/2749469.2750388","DOIUrl":"https://doi.org/10.1145/2749469.2750388","url":null,"abstract":"Racetrack memory is an emerging non-volatile memory based on spintronic domain wall technology. It can achieve ultra-high storage density. Also, its read/write speed is comparable to that of SRAM. Due to the tape-like structure of its storage cell, a “shift” operation is introduced to access racetrack memory. Thus, prior research mainly focused on minimizing shift latency/energy of racetrack memory while leveraging its ultra-high storage density. Yet the reliability issue of a shift operation, however, is not well addressed. In fact, racetrack memory suffers from unsuccessful shift due to domain misalignment. Such a problem is called “position error” in this work. It can significantly reduce mean-time-to-failure (MTTF) of racetrack memory to an intolerable level. Even worse, conventional error correction codes (ECCs), which are designed for “bit errors”, cannot protect racetrack memory from the position errors. In this work, we investigate the position error model of a shift operation and categorize position errors into two types: “stop-in-middle” error and “out-of-step” error. To eliminate the stop-in-middle error, we propose a technique called sub-threshold shift (STS) to perform a more reliable shift in two stages. To detect and recover the out-of-step error, a protection mechanism called position error correction code (p-ECC) is proposed. We first describe how to design a p-ECC for different protection strength and analyze corresponding design overhead. Then, we further propose how to reduce area cost of p-ECC by leveraging the “overhead region” in a racetrack memory stripe. With these protection mechanisms, we introduce a position-error-aware shift architecture. Experimental results demonstrate that, after using our techniques, the overall MTTF of racetrack memory is improved from 1.33μs to more than 69 years, with only 0.2% performance degradation. Trade-off among reliability, area, performance, and energy is also explored with comprehensive discussion.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"694-706"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79047680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers 深度神经网络与Tonic:深度神经网络作为一种服务及其对未来仓库规模计算机的影响

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2749472

Johann Hauswald, Yiping Kang, M. Laurenzano, Quan Chen, Cheng Li, T. Mudge, R. Dreslinski, Jason Mars, Lingjia Tang

As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.

随着苹果Siri、b谷歌Now、微软Cortana和亚马逊Echo等应用程序的不断发展，网络服务公司正在采用大型深度神经网络(DNN)来应对机器学习挑战，如图像处理、语音识别、自然语言处理等。关于DNN专用服务器平台的设计以及现代仓库规模计算机(WSCs)应该如何配备以提供DNN作为这些应用程序的服务，出现了许多悬而未决的问题。在本文中，我们介绍了DjiNN，一个在WSCs中作为服务的深度神经网络的开放基础设施，以及Tonic Suite，一个由7个端到端应用程序组成的套件，涵盖图像、语音和语言处理。我们使用DjiNN设计了一个基于大规模GPU服务器设计的高吞吐量DNN系统，并提供了关于不同应用程序特征的见解。在研究了DjiNN和Tonic Suite的吞吐量、带宽和功耗特性之后，我们研究了未来WSC架构的几个设计要点。我们研究了具有分解GPU池的WSC与由同质集成GPU服务器组成的WSC的总拥有成本含义。我们在NVIDIA K40 GPU上将DNN吞吐量提高了120倍以上，除了一个应用程序(面部识别40倍)。在由8个NVIDIA k40组成的GPU服务器上，我们为7个应用程序中的3个实现了近线性扩展(大约1000倍的吞吐量提高)。通过我们的分析，我们还发现支持gpu的wsc比仅支持cpu的设计提高了4-20倍的总拥有成本，具体取决于工作负载的组成。

{"title":"DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers","authors":"Johann Hauswald, Yiping Kang, M. Laurenzano, Quan Chen, Cheng Li, T. Mudge, R. Dreslinski, Jason Mars, Lingjia Tang","doi":"10.1145/2749469.2749472","DOIUrl":"https://doi.org/10.1145/2749469.2749472","url":null,"abstract":"As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, webservice companies are adopting large deep neural networks (DNN) for machine learning challenges such as image processing, speech recognition, natural language processing, among others. A number of open questions arise as to the design of a server platform specialized for DNN and how modern warehouse scale computers (WSCs) should be outfitted to provide DNN as a service for these applications. In this paper, we present DjiNN, an open infrastructure for DNN as a service in WSCs, and Tonic Suite, a suite of 7 end-to-end applications that span image, speech, and language processing. We use DjiNN to design a high throughput DNN system based on massive GPU server designs and provide insights as to the varying characteristics across applications. After studying the throughput, bandwidth, and power properties of DjiNN and Tonic Suite, we investigate several design points for future WSC architectures. We investigate the total cost of ownership implications of having a WSC with a disaggregated GPU pool versus a WSC composed of homogeneous integrated GPU servers. We improve DNN throughput by over 120× for all but one application (40× for Facial Recognition) on an NVIDIA K40 GPU. On a GPU server composed of 8 NVIDIA K40s, we achieve near-linear scaling (around 1000× throughput improvement) for 3 of the 7 applications. Through our analysis, we also find that GPU-enabled WSCs improve total cost of ownership over CPU-only designs by 4-20×, depending on the composition of the workload.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"74 1","pages":"27-40"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77111629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 174

Reducing world switches in virtualized environment with flexible cross-world calls 通过灵活的跨世界调用减少虚拟环境中的世界切换

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750406

Wenhao Li, Yubin Xia, Haibo Chen, B. Zang, Haibing Guan

Modern computers are built with increasingly complex software stack crossing multiple layers (i.e., worlds), where cross-world call has been a necessity for various important purposes like security, reliability, and reduced complexity. Unfortunately, there is currently limited cross-world call support (e.g., syscall, vmcall), and thus other calls need to be emulated by detouring multiple times to the privileged software layer (i.e., OS kernel and hypervisor). This causes not only significant performance degradation, but also unnecessary implementation complexity. This paper argues that it is time to rethink the design of traditional cross-world call mechanisms by reviewing existing systems built upon hypervisors. Following the design philosophy of separating authentication from authorization, this paper advocates decoupling of the authorization on whether a world call is permitted (by software) from unforgeable identification of calling peers (by hardware). This results in a flexible cross-world call scheme (namely CrossOver) that allows secure, efficient and flexible cross-world calls across multiple layers not only within the same address space, but also across multiple address spaces. We demonstrate that CrossOver can be approximated by using existing hardware mechanism (namely VMFUNC) and a trivial modification of the VMFUNC mechanism can provide a full support of CrossOver. To show its usefulness, we have conducted case studies by using several recent systems such as Proxos, Hyper-Shell, Tahoma and ShadowContext. Performance measurements using full-system emulation and a real processor with VMFUNC shows that CrossOver significantly boosts the performance of the mentioned systems.

现代计算机是由越来越复杂的跨多层(即世界)的软件堆栈构建的，其中跨世界调用已经成为各种重要目的(如安全性、可靠性和降低复杂性)的必要条件。不幸的是，目前只有有限的跨世界调用支持(例如，sycall、vmcall)，因此其他调用需要通过多次绕道到特权软件层(即操作系统内核和管理程序)来模拟。这不仅会导致显著的性能下降，还会导致不必要的实现复杂性。本文认为，是时候通过回顾构建在管理程序之上的现有系统来重新考虑传统跨世界调用机制的设计了。遵循身份验证与授权分离的设计理念，本文主张将世界调用是否被允许(通过软件)的授权与调用对等体的不可伪造标识(通过硬件)解耦。这就产生了一个灵活的跨世界调用方案(即CrossOver)，它不仅允许在同一地址空间内，而且允许在多个地址空间内跨多层进行安全、高效和灵活的跨世界调用。我们证明了可以通过使用现有的硬件机制(即VMFUNC)来近似实现CrossOver，并且对VMFUNC机制进行简单的修改就可以提供对CrossOver的完全支持。为了显示其有用性，我们通过使用几个最新的系统(如Proxos, Hyper-Shell, Tahoma和ShadowContext)进行了案例研究。使用全系统仿真和带VMFUNC的真实处理器进行的性能测量表明，CrossOver显著提高了上述系统的性能。

{"title":"Reducing world switches in virtualized environment with flexible cross-world calls","authors":"Wenhao Li, Yubin Xia, Haibo Chen, B. Zang, Haibing Guan","doi":"10.1145/2749469.2750406","DOIUrl":"https://doi.org/10.1145/2749469.2750406","url":null,"abstract":"Modern computers are built with increasingly complex software stack crossing multiple layers (i.e., worlds), where cross-world call has been a necessity for various important purposes like security, reliability, and reduced complexity. Unfortunately, there is currently limited cross-world call support (e.g., syscall, vmcall), and thus other calls need to be emulated by detouring multiple times to the privileged software layer (i.e., OS kernel and hypervisor). This causes not only significant performance degradation, but also unnecessary implementation complexity. This paper argues that it is time to rethink the design of traditional cross-world call mechanisms by reviewing existing systems built upon hypervisors. Following the design philosophy of separating authentication from authorization, this paper advocates decoupling of the authorization on whether a world call is permitted (by software) from unforgeable identification of calling peers (by hardware). This results in a flexible cross-world call scheme (namely CrossOver) that allows secure, efficient and flexible cross-world calls across multiple layers not only within the same address space, but also across multiple address spaces. We demonstrate that CrossOver can be approximated by using existing hardware mechanism (namely VMFUNC) and a trivial modification of the VMFUNC mechanism can provide a full support of CrossOver. To show its usefulness, we have conducted case studies by using several recent systems such as Proxos, Hyper-Shell, Tahoma and ShadowContext. Performance measurements using full-system emulation and a real processor with VMFUNC shows that CrossOver significantly boosts the performance of the mentioned systems.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"68 1","pages":"375-387"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81257710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

ShiDianNao: Shifting vision processing closer to the sensor ShiDianNao:将视觉处理更靠近传感器

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750389

Zidong Du, Robert Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, O. Temam

In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.

近年来，神经网络加速器已被证明具有高能效和高性能，在识别和挖掘等重要领域具有广泛的应用范围。尽管如此，这种加速器的能效和性能仍然受到内存访问的限制。在本文中，我们关注图像应用，可以说是识别和挖掘应用中最重要的一类。卷积神经网络(CNN)是这些应用中最先进的神经网络，它们有一个重要的特性:权重在许多神经元之间共享，大大减少了神经网络的内存占用。该属性允许在SRAM中完全映射CNN，消除了对权重的所有DRAM访问。通过进一步将加速器提升到图像传感器旁边，可以消除所有剩余的DRAM访问，即输入和输出。在本文中，我们提出了这样一个CNN加速器，放置在CMOS或CCD传感器旁边。由于没有DRAM访问，再加上仔细利用cnn内部的特定数据访问模式，我们可以设计出比以前最先进的神经网络加速器节能60倍的加速器。我们提出了一个完整的设计，直到65纳米的布局，占地面积为4.86 mm2，功耗仅为320 mW，但仍然比高端gpu快30倍左右。

{"title":"ShiDianNao: Shifting vision processing closer to the sensor","authors":"Zidong Du, Robert Fasthuber, Tianshi Chen, P. Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, O. Temam","doi":"10.1145/2749469.2750389","DOIUrl":"https://doi.org/10.1145/2749469.2750389","url":null,"abstract":"In recent years, neural network accelerators have been shown to achieve both high energy efficiency and high performance for a broad application scope within the important category of recognition and mining applications. Still, both the energy efficiency and peiformance of such accelerators remain limited by memory accesses. In this paper, we focus on image applications, arguably the most important category among recognition and mining applications. The neural networks which are state-of-the-art for these applications are Convolutional Neural Networks (CNN), and they have an important property: weights are shared among many neurons, considerably reducing the neural network memory footprint. This property allows to entirely map a CNN within an SRAM, eliminating all DRAM accesses for weights. By further hoisting this accelerator next to the image sensor, it is possible to eliminate all remaining DRAM accesses, i.e., for inputs and outputs. In this paper, we propose such a CNN accelerator, placed next to a CMOS or CCD sensor. The absence of DRAM accesses combined with a careful exploitation of the specific data access patterns within CNNs allows us to design an accelerator which is 60x more energy efficient than the previous state-of-the-art neural network accelerator. We present a fult design down to the layout at 65 nm, with a modest footprint of 4.86 mm2 and consuming only 320 mW, but still about 30x faster than high-end GPUs.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"70 1","pages":"92-104"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74395640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 864

The Load Slice Core microarchitecture 负载切片核心微架构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750407

Trevor E. Carlson, W. Heirman, O. Allam, S. Kaxiras, L. Eeckhout

Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.

在暴露指令级并行性(ILP)的动机驱动下，微处理器内核已经从简单的有序管道演变为复杂的超标量无序设计。通过提取ILP，这些处理器还支持并行缓存和内存操作，这是一个有用的副作用。然而，今天，越来越多的片外内存墙和多核处理器复杂的缓存层次结构使得缓存和内存访问的成本越来越高。这增加了提取内存层次并行性(MHP)的重要性，同时减少了更通用但复杂且耗电的ilp提取技术的净影响。此外，对于在功率和能量受限的环境中运行的多核处理器，能效已经在很大程度上取代了单线程性能，成为主要关注的问题。基于这一观察，我们提出了一个核心微架构，其目标是在最大限度地提高能源效率的同时，直接生成对内存层次结构的并行访问。Load Slice内核扩展了高效的按顺序、停止使用的内核，采用了第二个按顺序管道，使内存访问和地址生成指令能够绕过主管道中的停止指令。包含导致加载和存储的地址生成指令的向后程序切片由硬件自动提取，使用一种新的迭代算法，不需要软件支持或重新编译。平均而言，负载切片核心在基准有序处理器的基础上提高了53%的性能，而面积开销仅为15%，功耗仅为22%，导致能源效率(MIPS/Watt)分别比有序和无序设计提高43%和4.7倍以上。此外，对于功耗和面积受限的多核设计，Load Slice Core的性能优于有序设计和无序设计，分别提高了53%和95%的性能，从而为未来的多核处理器提供了另一种方向。

{"title":"The Load Slice Core microarchitecture","authors":"Trevor E. Carlson, W. Heirman, O. Allam, S. Kaxiras, L. Eeckhout","doi":"10.1145/2749469.2750407","DOIUrl":"https://doi.org/10.1145/2749469.2750407","url":null,"abstract":"Driven by the motivation to expose instruction-level parallelism (ILP), microprocessor cores have evolved from simple, in-order pipelines into complex, superscalar out-of-order designs. By extracting ILP, these processors also enable parallel cache and memory operations as a useful side-effect. Today, however, the growing off-chip memory wall and complex cache hierarchies of many-core processors make cache and memory accesses ever more costly. This increases the importance of extracting memory hierarchy parallelism (MHP), while reducing the net impact of more general, yet complex and power-hungry ILP-extraction techniques. In addition, for multi-core processors operating in power- and energy-constrained environments, energy-efficiency has largely replaced single-thread performance as the primary concern. Based on this observation, we propose a core microarchitecture that is aimed squarely at generating parallel accesses to the memory hierarchy while maximizing energy efficiency. The Load Slice Core extends the efficient in-order, stall-on-use core with a second in-order pipeline that enables memory accesses and address-generating instructions to bypass stalled instructions in the main pipeline. Backward program slices containing address-generating instructions leading up to loads and stores are extracted automatically by the hardware, using a novel iterative algorithm that requires no software support or recompilation. On average, the Load Slice Core improves performance over a baseline in-order processor by 53% with overheads of only 15% in area and 22% in power, leading to an increase in energy efficiency (MIPS/Watt) over in-order and out-of-order designs by 43% and over 4.7×, respectively. In addition, for a power- and area-constrained many-core design, the Load Slice Core outperforms both in-order and out-of-order designs, achieving a 53% and 95% higher performance, respectively, thus providing an alternative direction for future many-core processors.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"35 1","pages":"272-284"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85625924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

CloudMonatt: An architecture for security health monitoring and attestation of virtual machines in cloud computing CloudMonatt:用于云计算中虚拟机的安全运行状况监控和认证的架构

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750422

Tianwei Zhang, R. Lee

Cloud customers need guarantees regarding the security of their virtual machines (VMs), operating within an Infrastructure as a Service (IaaS) cloud system. This is complicated by the customer not knowing where his VM is executing, and on the semantic gap between what the customer wants to know versus what can be measured in the cloud. We present an architecture for monitoring a VM's security health, with the ability to attest this to the customer in an unforgeable manner. We show a concrete implementation of property-based attestation and a full prototype based on the OpenStack open source cloud software.

云计算客户需要确保在基础设施即服务(IaaS)云系统中运行的虚拟机(vm)的安全性。由于客户不知道他的VM在哪里执行，并且客户想要知道的与云中可以测量的之间存在语义差距，这使得情况变得复杂。我们提供了一种用于监视VM安全运行状况的体系结构，能够以一种不可伪造的方式向客户证明这一点。我们展示了基于属性的认证的具体实现和基于OpenStack开源云软件的完整原型。

引用次数: 43

Semantic locality and context-based prefetching using reinforcement learning 使用强化学习的语义局部性和基于上下文的预取

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

Pub Date : 2015-06-13 DOI: 10.1145/2749469.2749473

L. Peled, Shie Mannor, U. Weiser, Yoav Etsion

Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a lower degree of spatial locality. This makes them less amenable to spatio-temporal prefetchers. In this paper, we introduce the concept of Semantic Locality, which uses inherent program semantics to characterize access relations. We show how, in principle, semantic locality can capture the relationship between data elements in a manner agnostic to the actual data layout, and we argue that semantic locality transcends spatio-temporal concerns. We further introduce the context-based memory prefetcher, which approximates semantic locality using reinforcement learning. The prefetcher identifies access patterns by applying reinforcement learning methods over machine and code attributes, that provide hints on memory access semantics. We test our prefetcher on a variety of benchmarks that employ both regular and irregular patterns. For the SPEC 2006 suite, it delivers speedups as high as 2.8× (20% on average) over a baseline with no prefetching, and outperforms leading spatio-temporal prefetchers. Finally, we show that the context-based prefetcher makes it possible for naive, pointer-based implementations of irregular algorithms to achieve performance comparable to that of spatially optimized code.

大多数现代内存预取器依赖于时空局域性来预测程序在不久的将来可能访问的内存地址。然而，新出现的工作负载越来越多地使用不规则的数据结构，因此表现出较低的空间局部性。这使得它们不太容易受到时空预取器的影响。本文引入了语义局部性的概念，利用固有的程序语义来描述访问关系。我们展示了语义局部性原则上如何以一种与实际数据布局无关的方式捕获数据元素之间的关系，并且我们认为语义局部性超越了时空问题。我们进一步介绍了基于上下文的记忆预取器，它使用强化学习近似语义局部性。预取器通过对机器和代码属性应用强化学习方法来识别访问模式，这提供了内存访问语义的提示。我们在使用规则和不规则模式的各种基准测试中测试我们的预取器。对于SPEC 2006套件，它在没有预取的情况下提供了高达2.8倍(平均20%)的加速，并且优于领先的时空预取器。最后，我们展示了基于上下文的预取器使得不规则算法的朴素的、基于指针的实现能够达到与空间优化代码相当的性能。

{"title":"Semantic locality and context-based prefetching using reinforcement learning","authors":"L. Peled, Shie Mannor, U. Weiser, Yoav Etsion","doi":"10.1145/2749469.2749473","DOIUrl":"https://doi.org/10.1145/2749469.2749473","url":null,"abstract":"Most modern memory prefetchers rely on spatio-temporal locality to predict the memory addresses likely to be accessed by a program in the near future. Emerging workloads, however, make increasing use of irregular data structures, and thus exhibit a lower degree of spatial locality. This makes them less amenable to spatio-temporal prefetchers. In this paper, we introduce the concept of Semantic Locality, which uses inherent program semantics to characterize access relations. We show how, in principle, semantic locality can capture the relationship between data elements in a manner agnostic to the actual data layout, and we argue that semantic locality transcends spatio-temporal concerns. We further introduce the context-based memory prefetcher, which approximates semantic locality using reinforcement learning. The prefetcher identifies access patterns by applying reinforcement learning methods over machine and code attributes, that provide hints on memory access semantics. We test our prefetcher on a variety of benchmarks that employ both regular and irregular patterns. For the SPEC 2006 suite, it delivers speedups as high as 2.8× (20% on average) over a baseline with no prefetching, and outperforms leading spatio-temporal prefetchers. Finally, we show that the context-based prefetcher makes it possible for naive, pointer-based implementations of irregular algorithms to achieve performance comparable to that of spatially optimized code.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"27 1","pages":"285-297"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86552348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 88