2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献

Record-Replay Architecture as a General Security Framework 作为通用安全框架的记录重放体系结构

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00025

Y. Shalabi, Mengjia Yan, N. Honarmand, R. Lee, J. Torrellas

Hardware security features need to strike a careful balance between design intrusiveness and completeness of methods. In addition, they need to be flexible, as security threats continuously evolve. To help address these requirements, this paper proposes a novel framework where Record and Deterministic Replay (RnR) is used to complement hardware security features. We call the framework RnR-Safe. RnR-Safe reduces the cost of security hardware by allowing it to be less precise at detecting attacks, potentially reporting false positives. This is because it relies on on-the-fly replay that transparently verifies whether the alarm is a real attack or a false positive. RnR-Safe uses two replayers: an always-on, fast Checkpoint replayer that periodically creates checkpoints, and a detailed-analysis Alarm replayer that is triggered when there is a threat alarm. As an example application, we use RnR-Safe to thwart Return Oriented Programming (ROP) attacks, including on the Linux kernel. Our design augments the Return Address Stack (RAS) with relatively inexpensive hardware. We evaluate RnR-Safe using a variety of workloads on virtual machines running Linux. We find that RnR-Safe is very effective. Thanks to the judicious RAS hardware extensions and hypervisor changes, the checkpointing replayer has an execution speed comparable to the recorded execution. Also, the alarm replayer needs to handle very few false positives.

硬件安全特性需要在设计侵入性和方法的完整性之间取得谨慎的平衡。此外，随着安全威胁的不断发展，它们需要具有灵活性。为了帮助解决这些需求，本文提出了一个新的框架，其中使用记录和确定性重播(RnR)来补充硬件安全特性。我们称这个框架为RnR-Safe。RnR-Safe降低了安全硬件的成本，允许它在检测攻击时不那么精确，可能会报告误报。这是因为它依赖于实时回放，透明地验证警报是真正的攻击还是误报。RnR-Safe使用两个重播器:一个是始终在线的快速检查点重播器，它定期创建检查点;另一个是详细分析警报重播器，它在出现威胁警报时触发。作为一个示例应用程序，我们使用RnR-Safe来阻止面向返回的编程(ROP)攻击，包括对Linux内核的攻击。我们的设计用相对便宜的硬件增强了返回地址堆栈(RAS)。我们在运行Linux的虚拟机上使用各种工作负载来评估RnR-Safe。我们发现RnR-Safe非常有效。多亏了明智的RAS硬件扩展和管理程序更改，检查点重放器的执行速度与记录的执行速度相当。此外，警报回放器需要处理很少的误报。

{"title":"Record-Replay Architecture as a General Security Framework","authors":"Y. Shalabi, Mengjia Yan, N. Honarmand, R. Lee, J. Torrellas","doi":"10.1109/HPCA.2018.00025","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00025","url":null,"abstract":"Hardware security features need to strike a careful balance between design intrusiveness and completeness of methods. In addition, they need to be flexible, as security threats continuously evolve. To help address these requirements, this paper proposes a novel framework where Record and Deterministic Replay (RnR) is used to complement hardware security features. We call the framework RnR-Safe. RnR-Safe reduces the cost of security hardware by allowing it to be less precise at detecting attacks, potentially reporting false positives. This is because it relies on on-the-fly replay that transparently verifies whether the alarm is a real attack or a false positive. RnR-Safe uses two replayers: an always-on, fast Checkpoint replayer that periodically creates checkpoints, and a detailed-analysis Alarm replayer that is triggered when there is a threat alarm. As an example application, we use RnR-Safe to thwart Return Oriented Programming (ROP) attacks, including on the Linux kernel. Our design augments the Return Address Stack (RAS) with relatively inexpensive hardware. We evaluate RnR-Safe using a variety of workloads on virtual machines running Linux. We find that RnR-Safe is very effective. Thanks to the judicious RAS hardware extensions and hypervisor changes, the checkpointing replayer has an execution speed comparable to the recorded execution. Also, the alarm replayer needs to handle very few false positives.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"15 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113960663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs LATTE-CC:延迟容忍度感知的自适应缓存压缩管理节能gpu

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00028

A. Arunkumar, Shin-Ying Lee, Vignesh Soundararajan, Carole-Jean Wu

General-purpose GPU applications are significantly constrained by the efficiency of the memory subsystem and the availability of data cache capacity on GPUs. Cache compression, while is able to expand the effective cache capacity and improve cache efficiency, comes with the cost of increased hit latency. This has constrained the application of cache compression to mostly lower level caches, leaving it unexplored for L1 caches and for GPUs. Directly applying state-of-the-art high performance cache compression schemes on GPUs results in a wide performance variation from -52% to 48%. To maximize the performance and energy benefits of cache compression for GPUs, we propose a new compression management scheme, called LATTE-CC. LATTE-CC is designed to exploit the dynamically-varying latency tolerance feature of GPUs. LATTE-CC compresses cache lines based on its prediction of the degree of latency tolerance of GPU streaming multiprocessors and by choosing between three distinct compression modes: no compression, low-latency, and high-capacity. LATTE-CC improves the performance of cache sensitive GPGPU applications by as much as 48.4% and by an average of 19.2%, outperforming the static application of compression algorithms. LATTE-CC also reduces GPU energy consumption by an average of 10%, which is twice as much as that of the state-of-the-art compression scheme.

通用GPU应用程序受到内存子系统的效率和GPU上数据缓存容量的可用性的显著限制。缓存压缩虽然能够扩展有效的缓存容量并提高缓存效率，但代价是增加了命中延迟。这将缓存压缩的应用限制在低级缓存上，使得L1缓存和gpu无法使用它。直接在gpu上应用最先进的高性能缓存压缩方案会导致从-52%到48%的广泛性能变化。为了最大限度地提高gpu缓存压缩的性能和能源效益，我们提出了一种新的压缩管理方案，称为LATTE-CC。LATTE-CC旨在利用gpu动态变化的延迟容忍特性。LATTE-CC基于对GPU流多处理器的延迟容忍程度的预测来压缩缓存线，并通过在三种不同的压缩模式之间进行选择:无压缩、低延迟和高容量。LATTE-CC对缓存敏感的GPGPU应用程序的性能提高高达48.4%，平均提高19.2%，优于静态压缩算法的应用程序。LATTE-CC还可以平均降低10%的GPU能耗，这是最先进压缩方案的两倍。

{"title":"LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs","authors":"A. Arunkumar, Shin-Ying Lee, Vignesh Soundararajan, Carole-Jean Wu","doi":"10.1109/HPCA.2018.00028","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00028","url":null,"abstract":"General-purpose GPU applications are significantly constrained by the efficiency of the memory subsystem and the availability of data cache capacity on GPUs. Cache compression, while is able to expand the effective cache capacity and improve cache efficiency, comes with the cost of increased hit latency. This has constrained the application of cache compression to mostly lower level caches, leaving it unexplored for L1 caches and for GPUs. Directly applying state-of-the-art high performance cache compression schemes on GPUs results in a wide performance variation from -52% to 48%. To maximize the performance and energy benefits of cache compression for GPUs, we propose a new compression management scheme, called LATTE-CC. LATTE-CC is designed to exploit the dynamically-varying latency tolerance feature of GPUs. LATTE-CC compresses cache lines based on its prediction of the degree of latency tolerance of GPU streaming multiprocessors and by choosing between three distinct compression modes: no compression, low-latency, and high-capacity. LATTE-CC improves the performance of cache sensitive GPGPU applications by as much as 48.4% and by an average of 19.2%, outperforming the static application of compression algorithms. LATTE-CC also reduces GPU energy consumption by an average of 10%, which is twice as much as that of the state-of-the-art compression scheme.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115673131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores 面向商品多核的混合缓存分区共享技术

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00019

Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, H. Kasture, Xiaosong Ma, Daniel Sánchez

Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.

现在在商业硬件中可以使用缓存分区。理论上，软件可以利用缓存分区来更好地使用最后一级缓存并提高性能。然而，在实践中，当前的系统实现了方式分区，它提供了有限数量的分区，并且经常损害性能。这些限制浪费了智能缓存管理的性能潜力。我们提出了KPart，这是一种混合缓存分区共享技术，它避开了方式分区的限制，并在当前系统上解锁了显著的性能。KPart首先将应用程序分组到集群中，然后在这些集群之间对缓存进行分区。为了构建集群，KPart依赖于一种新技术来估计应用程序在共享分区时遭受的性能损失。KPart自动选择集群的数量，平衡方式分区的隔离优势及其潜在的性能影响。KPart使用详细的分析信息来做出这些决策。这些信息既可以离线收集，也可以使用一种新的分析机制以低开销在线收集。我们在实际系统和仿真中对KPart进行了评估。KPart在Intel Broadwell-D系统上平均提高了24%(最高79%)的吞吐量，而之前的每个应用程序分区策略平均只提高了1.7%的吞吐量，并损害了30%的工作负载。仿真结果表明，KPart实现了硬件中尚未实现的更高级的分区技术的大部分性能。

{"title":"KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores","authors":"Nosayba El-Sayed, Anurag Mukkara, Po-An Tsai, H. Kasture, Xiaosong Ma, Daniel Sánchez","doi":"10.1109/HPCA.2018.00019","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00019","url":null,"abstract":"Cache partitioning is now available in commercial hardware. In theory, software can leverage cache partitioning to use the last-level cache better and improve performance. In practice, however, current systems implement way-partitioning, which offers a limited number of partitions and often hurts performance. These limitations squander the performance potential of smart cache management. We present KPart, a hybrid cache partitioning-sharing technique that sidesteps the limitations of way-partitioning and unlocks significant performance on current systems. KPart first groups applications into clusters, then partitions the cache among these clusters. To build clusters, KPart relies on a novel technique to estimate the performance loss an application suffers when sharing a partition. KPart automatically chooses the number of clusters, balancing the isolation benefits of way-partitioning with its potential performance impact. KPart uses detailed profiling information to make these decisions. This information can be gathered either offline, or online at low overhead using a novel profiling mechanism. We evaluate KPart in a real system and in simulation. KPart improves throughput by 24% on average (up to 79%) on an Intel Broadwell-D system, whereas prior per-application partitioning policies improve throughput by just 1.7% on average and hurt 30% of workloads. Simulation results show that KPart achieves most of the performance of more advanced partitioning techniques that are not yet available in hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129464714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

RCoal: Mitigating GPU Timing Attack via Subwarp-Based Randomized Coalescing Techniques RCoal:通过基于subwarp的随机合并技术减轻GPU时序攻击

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00023

Gurunath Kadam, Danfeng Zhang, Adwait Jog

Graphics processing units (GPUs) are becoming default accelerators in many domains such as high-performance computing (HPC), deep learning, and virtual/augmented reality. Recently, GPUs have also shown significant speedups for a variety of security-sensitive applications such as encryptions. These speedups have largely benefited from the high memory bandwidth and compute throughput of GPUs. One of the key features to optimize the memory bandwidth consumption in GPUs is intra-warp memory access coalescing, which merges memory requests originating from different threads of a single warp into as few cache lines as possible. However, this coalescing feature is also shown to make the GPUs prone to the correlation timing attacks as it exposes the relationship between the execution time and the number of coalesced accesses. Consequently, an attacker is able to correctly reveal an AES private key via repeatedly gathering encrypted data and execution time on a GPU. In this work, we propose a series of defense mechanisms to alleviate such timing attacks by carefully trading off performance for improved security. Specifically, we propose to randomize the coalescing logic such that the attacker finds it hard to guess the correct number of coalesced accesses generated. To this end, we propose to randomize: a) the granularity (called as subwarp) at which warp threads are grouped together for coalescing, and b) the threads selected by each subwarp for coalescing. Such randomization techniques result in three mechanisms: fixed-sized subwarp (FSS), random-sized subwarp (RSS), and random-threaded subwarp (RTS). We find that the combination of these security mechanisms offers 24- to 961-times improvement in the security against the correlation timing attacks with 5 to 28% performance degradation.

图形处理单元(gpu)正在成为许多领域的默认加速器，例如高性能计算(HPC)、深度学习和虚拟/增强现实。最近，gpu在加密等各种对安全敏感的应用程序中也显示出显著的加速。这些加速很大程度上得益于gpu的高内存带宽和计算吞吐量。优化gpu内存带宽消耗的关键特性之一是warp内内存访问合并，它将来自单个warp的不同线程的内存请求合并到尽可能少的缓存线中。然而，这种合并特性也显示出gpu容易受到相关计时攻击，因为它暴露了执行时间和合并访问次数之间的关系。因此，攻击者能够通过在GPU上反复收集加密数据和执行时间来正确地揭示AES私钥。在这项工作中，我们提出了一系列防御机制，通过谨慎地权衡性能以提高安全性来减轻这种定时攻击。具体来说，我们建议随机化合并逻辑，使攻击者很难猜测生成的合并访问的正确数量。为此，我们建议随机化:a)将warp线程分组在一起进行合并的粒度(称为subwarp)，以及b)每个subwarp选择用于合并的线程。这种随机化技术产生了三种机制:固定大小的次曲(FSS)、随机大小的次曲(RSS)和随机线程的次曲(RTS)。我们发现，这些安全机制的组合在对抗相关定时攻击的安全性方面提供了24到961倍的提高，而性能下降了5%到28%。

{"title":"RCoal: Mitigating GPU Timing Attack via Subwarp-Based Randomized Coalescing Techniques","authors":"Gurunath Kadam, Danfeng Zhang, Adwait Jog","doi":"10.1109/HPCA.2018.00023","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00023","url":null,"abstract":"Graphics processing units (GPUs) are becoming default accelerators in many domains such as high-performance computing (HPC), deep learning, and virtual/augmented reality. Recently, GPUs have also shown significant speedups for a variety of security-sensitive applications such as encryptions. These speedups have largely benefited from the high memory bandwidth and compute throughput of GPUs. One of the key features to optimize the memory bandwidth consumption in GPUs is intra-warp memory access coalescing, which merges memory requests originating from different threads of a single warp into as few cache lines as possible. However, this coalescing feature is also shown to make the GPUs prone to the correlation timing attacks as it exposes the relationship between the execution time and the number of coalesced accesses. Consequently, an attacker is able to correctly reveal an AES private key via repeatedly gathering encrypted data and execution time on a GPU. In this work, we propose a series of defense mechanisms to alleviate such timing attacks by carefully trading off performance for improved security. Specifically, we propose to randomize the coalescing logic such that the attacker finds it hard to guess the correct number of coalesced accesses generated. To this end, we propose to randomize: a) the granularity (called as subwarp) at which warp threads are grouped together for coalescing, and b) the threads selected by each subwarp for coalescing. Such randomization techniques result in three mechanisms: fixed-sized subwarp (FSS), random-sized subwarp (RSS), and random-threaded subwarp (RTS). We find that the combination of these security mechanisms offers 24- to 961-times improvement in the security against the correlation timing attacks with 5 to 28% performance degradation.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131756860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator OuterSPACE:一个基于外积的稀疏矩阵乘法加速器

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00067

S. Pal, Jonathan Beaumont, Dong-hyeon Park, Aporva Amarnath, Siying Feng, C. Chakrabarti, Hun-Seok Kim, D. Blaauw, T. Mudge, R. Dreslinski

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.

稀疏矩阵广泛应用于图和数据分析、机器学习、工程和科学应用。本文描述并分析了针对大型稀疏矩阵应用的加速器OuterSPACE。OuterSPACE是一种高度可扩展、节能、可重构的设计，由大规模并行单程序、多数据(SPMD)式处理单元、分布式存储器、高速交叉条和高带宽存储器(HBM)组成。我们发现非零的冗余内存访问是传统稀疏矩阵-矩阵乘法算法的关键瓶颈。为了改善这一点，我们实现了一种基于外部乘积的矩阵乘法技术，该技术通过将乘法与积累解耦来消除冗余访问。我们证明，传统架构由于其内存层次结构和在算法中利用并行性的能力的限制，无法在不产生显著开销的情况下利用这种减少。“外层空间”就是专门为克服这些挑战而设计的。我们在来自佛罗里达大学的SuiteSparse集合和斯坦福网络分析项目的各种矩阵上使用gem5模拟我们架构的关键组件，并显示在Xeon CPU上比Intel Math Kernel Library平均加速7.9倍，在NVIDIA K40 GPU上运行时比cuSPARSE平均加速13.0倍，比CUSP平均加速14.0倍，同时在24w功率预算下实现2.9 GFLOPS的平均吞吐量在87 mm2的面积上。

{"title":"OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator","authors":"S. Pal, Jonathan Beaumont, Dong-hyeon Park, Aporva Amarnath, Siying Feng, C. Chakrabarti, Hun-Seok Kim, D. Blaauw, T. Mudge, R. Dreslinski","doi":"10.1109/HPCA.2018.00067","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00067","url":null,"abstract":"Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126966140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 179

WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs 扭曲指令重用以最小化gpu中的重复计算

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00041

Keunsoo Kim, W. Ro

Warp instructions with an identical arithmetic operation on same input values produce the identical computation results. This paper proposes warp instruction reuse to allow such repeated warp instructions to reuse previous computation results instead of actually executing the instructions. Bypassing register reading, functional unit, and register writing operations improves energy efficiency. This reuse technique is especially beneficial for GPUs since a GPU warp register is usually as wide as thousands of bits. In addition, we propose warp register reuse which allows identical warp register values to share a single physical register through register renaming. The register reuse technique enables to see if different logical warp registers have an identical value by only looking at their physical warp register IDs. Based on this observation, warp register reuse helps to perform all necessary operations for warp instruction reuse with register IDs, which is substantially more efficient than directly manipulating register values. Performance evaluation shows that 20.5% SM energy and 10.7% GPU energy can be saved by allowing 18.7% of warp instructions to reuse prior results.

对相同输入值进行相同算术运算的Warp指令产生相同的计算结果。本文提出了warp指令重用，允许重复的warp指令重用以前的计算结果，而不是实际执行这些指令。绕过寄存器读取、功能单元和寄存器写入操作，提高了能源效率。这种重用技术对GPU特别有利，因为GPU的翘曲寄存器通常宽达数千位。此外，我们建议warp寄存器重用，它允许相同的warp寄存器值通过寄存器重命名共享单个物理寄存器。寄存器重用技术可以通过查看不同的逻辑翘曲寄存器的物理翘曲寄存器id来查看它们是否具有相同的值。基于这一观察，warp寄存器重用有助于使用寄存器id执行所有必要的warp指令重用操作，这比直接操作寄存器值要有效得多。性能评估表明，允许18.7%的warp指令重用先前的结果，可以节省20.5%的SM能量和10.7%的GPU能量。

{"title":"WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs","authors":"Keunsoo Kim, W. Ro","doi":"10.1109/HPCA.2018.00041","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00041","url":null,"abstract":"Warp instructions with an identical arithmetic operation on same input values produce the identical computation results. This paper proposes warp instruction reuse to allow such repeated warp instructions to reuse previous computation results instead of actually executing the instructions. Bypassing register reading, functional unit, and register writing operations improves energy efficiency. This reuse technique is especially beneficial for GPUs since a GPU warp register is usually as wide as thousands of bits. In addition, we propose warp register reuse which allows identical warp register values to share a single physical register through register renaming. The register reuse technique enables to see if different logical warp registers have an identical value by only looking at their physical warp register IDs. Based on this observation, warp register reuse helps to perform all necessary operations for warp instruction reuse with register IDs, which is substantially more efficient than directly manipulating register values. Performance evaluation shows that 20.5% SM energy and 10.7% GPU energy can be saved by allowing 18.7% of warp instructions to reuse prior results.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128792266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Secure DIMM: Moving ORAM Primitives Closer to Memory 安全DIMM:移动ORAM原语更靠近内存

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-03-27 DOI: 10.1109/HPCA.2018.00044

Ali Shafiee, R. Balasubramonian, Mohit Tiwari, Feifei Li

As more critical applications move to the cloud, there is a pressing need to provide privacy guarantees for data and computation. While cloud infrastructures are vulnerable to a variety of attacks, in this work, we focus on an attack model where an untrusted cloud operator has physical access to the server and can monitor the signals emerging from the processor socket. Even if data packets are encrypted, the sequence of addresses touched by the program serves as an information side channel. To eliminate this side channel, Oblivious RAM constructs have been investigated for decades, but continue to pose large overheads. In this work, we make the case that ORAM overheads can be significantly reduced by moving some ORAM functionality into the memory system. We first design a secure DIMM (or SDIMM) that uses commodity low-cost memory and an ASIC as a secure buffer chip. We then design two new ORAM protocols that leverage SDIMMs to reduce bandwidth, latency, and energy per ORAM access. In both protocols, each SDIMM is responsible for part of the ORAM tree. Each SDIMM performs a number of ORAM operations that are not visible to the main memory channel. By having many SDIMMs in the system, we are able to achieve highly parallel ORAM operations. The main memory channel uses its bandwidth primarily to service blocks requested by the CPU, and to perform a small subset of the many shuffle operations required by conventional ORAM. The new protocols guarantee the same obliviousness properties as Path ORAM. On a set of memory-intensive workloads, our two new ORAM protocols – Independent ORAM and Split ORAM – are able to improve performance by 1.9x and energy by 2.55x, compared to Freecursive ORAM.

随着越来越多的关键应用程序迁移到云端，迫切需要为数据和计算提供隐私保障。虽然云基础设施容易受到各种攻击，但在这项工作中，我们将重点关注一种攻击模型，其中不受信任的云操作人员可以对服务器进行物理访问，并可以监视从处理器套接字发出的信号。即使数据包是加密的，程序所接触的地址序列也充当信息侧通道。为了消除这个侧通道，遗忘RAM结构已经研究了几十年，但仍然带来了很大的开销。在这项工作中，我们认为通过将一些ORAM功能移动到内存系统中可以显著降低ORAM开销。我们首先设计了一种安全DIMM(或SDIMM)，它使用商品低成本内存和ASIC作为安全缓冲芯片。然后，我们设计了两个新的ORAM协议，利用sdimm来减少每次ORAM访问的带宽、延迟和能量。在这两种协议中，每个SDIMM都负责ORAM树的一部分。每个SDIMM执行许多对主内存通道不可见的ORAM操作。通过在系统中使用许多sdimm，我们能够实现高度并行的ORAM操作。主内存通道主要使用其带宽来服务CPU请求的块，并执行传统ORAM所需的许多shuffle操作的一小部分。新协议保证了与Path ORAM相同的遗忘属性。在一组内存密集型工作负载上，我们的两个新的ORAM协议——独立ORAM和分裂ORAM——与自由递归ORAM相比，能够提高1.9倍的性能和2.55倍的能耗。

{"title":"Secure DIMM: Moving ORAM Primitives Closer to Memory","authors":"Ali Shafiee, R. Balasubramonian, Mohit Tiwari, Feifei Li","doi":"10.1109/HPCA.2018.00044","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00044","url":null,"abstract":"As more critical applications move to the cloud, there is a pressing need to provide privacy guarantees for data and computation. While cloud infrastructures are vulnerable to a variety of attacks, in this work, we focus on an attack model where an untrusted cloud operator has physical access to the server and can monitor the signals emerging from the processor socket. Even if data packets are encrypted, the sequence of addresses touched by the program serves as an information side channel. To eliminate this side channel, Oblivious RAM constructs have been investigated for decades, but continue to pose large overheads. In this work, we make the case that ORAM overheads can be significantly reduced by moving some ORAM functionality into the memory system. We first design a secure DIMM (or SDIMM) that uses commodity low-cost memory and an ASIC as a secure buffer chip. We then design two new ORAM protocols that leverage SDIMMs to reduce bandwidth, latency, and energy per ORAM access. In both protocols, each SDIMM is responsible for part of the ORAM tree. Each SDIMM performs a number of ORAM operations that are not visible to the main memory channel. By having many SDIMMs in the system, we are able to achieve highly parallel ORAM operations. The main memory channel uses its bandwidth primarily to service blocks requested by the CPU, and to perform a small subset of the many shuffle operations required by conventional ORAM. The new protocols guarantee the same obliviousness properties as Path ORAM. On a set of memory-intensive workloads, our two new ORAM protocols – Independent ORAM and Split ORAM – are able to improve performance by 1.9x and energy by 2.55x, compared to Freecursive ORAM.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123159473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

A Case for Packageless Processors 无封装处理器的案例

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-28 DOI: 10.1109/HPCA.2018.00047

Saptadeep Pal, Daniel Petrisko, A. Bajwa, Puneet Gupta, S. Iyer, Rakesh Kumar

Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside packages to enable PCB-based integration. We argue that packages reduce the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70%, and area efficiency by a factor of 5 to 18. Further, silicon chips have scaled well while packages have not. We propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58% (16% average), 136%(103% average), and 295% (80% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76%.

提高性能的需求远远超过了传统性能扩展方法的能力。需要颠覆性的解决方案来超越渐进式改进。传统上，处理器驻留在包中以实现基于pcb的集成。我们认为封装将处理器的潜在存储带宽减少了至少一个数量级，允许热设计功率(TDP)减少了高达70%，面积效率减少了5到18倍。此外，硅芯片可以很好地扩展，而封装则不行。我们提出无封装处理器，即使用新型集成技术硅互连结构(Si-IF)将封装移除并直接安装在硅板上的处理器。我们表明，基于si - if的无封装处理器的性能分别比封装处理器高出58%(平均16%)、136%(平均103%)和295%(平均80%)，这是由于内存带宽的增加、允许TDP的增加和面积的减少。我们还将无封装处理的概念扩展到整个处理器和内存系统，其中面积占用减少高达76%。

{"title":"A Case for Packageless Processors","authors":"Saptadeep Pal, Daniel Petrisko, A. Bajwa, Puneet Gupta, S. Iyer, Rakesh Kumar","doi":"10.1109/HPCA.2018.00047","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00047","url":null,"abstract":"Demand for increasing performance is far outpacing the capability of traditional methods for performance scaling. Disruptive solutions are needed to advance beyond incremental improvements. Traditionally, processors reside inside packages to enable PCB-based integration. We argue that packages reduce the potential memory bandwidth of a processor by at least one order of magnitude, allowable thermal design power (TDP) by up to 70%, and area efficiency by a factor of 5 to 18. Further, silicon chips have scaled well while packages have not. We propose packageless processors - processors where packages have been removed and dies directly mounted on a silicon board using a novel integration technology, Silicon Interconnection Fabric (Si-IF). We show that Si-IF-based packageless processors outperform their packaged counterparts by up to 58% (16% average), 136%(103% average), and 295% (80% average) due to increased memory bandwidth, increased allowable TDP, and reduced area respectively. We also extend the concept of packageless processing to the entire processor and memory system, where the area footprint reduction was up to 76%.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126787180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Crash Consistency in Encrypted Non-volatile Main Memory Systems 加密非易失性主存系统的崩溃一致性

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-24 DOI: 10.1109/HPCA.2018.00035

Sihang Liu, Aasheesh Kolli, Jinglei Ren, S. Khan

Non-Volatile Main Memory (NVMM) systems provide high performance by directly manipulating persistent data in-memory, but require crash consistency support to recover data in a consistent state in case of a power failure or system crash. In this work, we focus on the interplay between the crash consistency mechanisms and memory encryption. Memory encryption is necessary for these systems to protect data against the attackers with physical access to the persistent main memory. As decrypting data at every memory read access can significantly degrade the performance, prior works propose to use a memory encryption technique, counter-mode encryption, that reduces the decryption overhead by performing a memory read access in parallel with the decryption process using a counter associated with each cache line. Therefore, a pair of data and counter value is needed to correctly decrypt data after a system crash. We demonstrate that counter-mode encryption does not readily extend to crash consistent NVMM systems as the system will fail to recover data in a consistent state if the encrypted data and associated counter are not written back to memory atomically, a requirement we refer to as counter-atomicity. We show that na¨ıvely enforcing counter-atomicity for all NVMM writes can serialize memory accesses and results in a significant performance degradation. In order to improve the performance, we make an observation that not all writes to NVMM need to be counter-atomic. The crash consistency mechanisms rely on versioning to keep one consistent copy of data intact while manipulating another version directly in-memory. As the recovery process only relies on the unmodified consistent version, it is not necessary to strictly enforce counter-atomicity for the writes that do not affect data recovery. Based on this insight, we propose selective counter-atomicity that allows reordering of writes to data and associated counters when the writes to persistent memory do not alter the recoverable consistent state. We propose efficient software and hardware support to enforce selective counter-atomicity. Our evaluation demonstrates that in a 1/2/4/8- core system, selective counter-atomicity improves performance by 6/11/22/40% compared to a system that enforces counter-atomicity for all NVMM writes. The performance of our selective counter-atomicity design comes within 5% of an ideal NVMM system that provides crash consistency of encrypted data at no cost.

非易失性主内存(NVMM)系统通过直接操作内存中的持久数据提供高性能，但需要崩溃一致性支持，以便在电源故障或系统崩溃的情况下恢复一致状态的数据。在这项工作中，我们关注崩溃一致性机制和内存加密之间的相互作用。内存加密对于这些系统来说是必要的，它可以保护数据免受攻击者对持久主内存的物理访问。由于在每次内存读访问时解密数据会显著降低性能，因此先前的工作建议使用一种内存加密技术，即反模式加密，通过使用与每个缓存线相关的计数器在解密过程中并行执行内存读访问来减少解密开销。因此，需要一对数据和计数器值来在系统崩溃后正确解密数据。我们证明，反模式加密不会轻易扩展到崩溃一致的NVMM系统，因为如果加密的数据和相关的计数器没有被原子地写回内存，系统将无法恢复一致状态下的数据，我们将这种需求称为反原子性。我们表明，对所有NVMM写执行反原子性可以序列化内存访问，并导致显著的性能下降。为了提高性能，我们观察到并非所有对NVMM的写入都需要是反原子的。崩溃一致性机制依赖于版本控制来保持数据的一个一致副本的完整性，同时直接在内存中操作另一个版本。由于恢复过程只依赖于未修改的一致版本，因此没有必要对不影响数据恢复的写操作严格执行反原子性。基于这一见解，我们提出了选择性反原子性，当对持久内存的写入不改变可恢复的一致状态时，允许对数据和相关计数器的写入进行重新排序。我们提出了有效的软件和硬件支持来强制选择性反原子性。我们的评估表明，在一个1/2/4/8核系统中，选择性反原子性比为所有NVMM写强制反原子性的系统提高了6/11/22/40%的性能。我们的选择性反原子性设计的性能在理想NVMM系统的5%以内，该系统可以免费提供加密数据的崩溃一致性。

{"title":"Crash Consistency in Encrypted Non-volatile Main Memory Systems","authors":"Sihang Liu, Aasheesh Kolli, Jinglei Ren, S. Khan","doi":"10.1109/HPCA.2018.00035","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00035","url":null,"abstract":"Non-Volatile Main Memory (NVMM) systems provide high performance by directly manipulating persistent data in-memory, but require crash consistency support to recover data in a consistent state in case of a power failure or system crash. In this work, we focus on the interplay between the crash consistency mechanisms and memory encryption. Memory encryption is necessary for these systems to protect data against the attackers with physical access to the persistent main memory. As decrypting data at every memory read access can significantly degrade the performance, prior works propose to use a memory encryption technique, counter-mode encryption, that reduces the decryption overhead by performing a memory read access in parallel with the decryption process using a counter associated with each cache line. Therefore, a pair of data and counter value is needed to correctly decrypt data after a system crash. We demonstrate that counter-mode encryption does not readily extend to crash consistent NVMM systems as the system will fail to recover data in a consistent state if the encrypted data and associated counter are not written back to memory atomically, a requirement we refer to as counter-atomicity. We show that na¨ıvely enforcing counter-atomicity for all NVMM writes can serialize memory accesses and results in a significant performance degradation. In order to improve the performance, we make an observation that not all writes to NVMM need to be counter-atomic. The crash consistency mechanisms rely on versioning to keep one consistent copy of data intact while manipulating another version directly in-memory. As the recovery process only relies on the unmodified consistent version, it is not necessary to strictly enforce counter-atomicity for the writes that do not affect data recovery. Based on this insight, we propose selective counter-atomicity that allows reordering of writes to data and associated counters when the writes to persistent memory do not alter the recoverable consistent state. We propose efficient software and hardware support to enforce selective counter-atomicity. Our evaluation demonstrates that in a 1/2/4/8- core system, selective counter-atomicity improves performance by 6/11/22/40% compared to a system that enforces counter-atomicity for all NVMM writes. The performance of our selective counter-atomicity design comes within 5% of an ideal NVMM system that provides crash consistency of encrypted data at no cost.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114209526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

SIPT: Speculatively Indexed, Physically Tagged Caches SIPT:推测索引，物理标记缓存

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2018-02-24 DOI: 10.1109/HPCA.2018.00020

Tianhao Zheng, Haishan Zhu, M. Erez

First-level (L1) data cache access latency is critical to performance because it services the vast majority of loads and stores. To keep L1 latency low while ensuring low-complexity and simple-to-verify operation, current processors most-typically utilize a virtually-indexed physically-tagged (VIPT) cache architecture. While VIPT caches decrease latency by proceeding with cache access and address translation concurrently, each cache way is constrained by the size of a virtual page. Thus, larger L1 caches are highly-associative, which degrades their access latency and energy. We propose speculatively-indexed physically-tagged (SIPT) caches to enable simultaneously larger, faster, and more efficient L1 caches. A SIPT cache speculates on the value of a few address bits beyond the page offset concurrently with address translation, maintaining the overall safe and reliable architecture of a VIPT cache while eliminating the VIPT design constraints. SIPT is a purely microarchitectural approach that can be used with any software and for all accesses. We evaluate SIPT with simulations of applications under standard Linux. SIPT improves performance by 8.1% on average and reduces total cache-hierarchy energy by 15.6%.

一级(L1)数据缓存访问延迟对性能至关重要，因为它服务于绝大多数负载和存储。为了保持较低的L1延迟，同时确保低复杂性和易于验证的操作，当前的处理器通常使用虚拟索引物理标记(virtual -indexed physical -tagged, VIPT)缓存架构。虽然VIPT缓存通过并发地进行缓存访问和地址转换来减少延迟，但每种缓存方式都受到虚拟页面大小的限制。因此，较大的L1缓存是高度关联的，这降低了它们的访问延迟和能量。我们建议使用推测索引物理标记(SIPT)缓存来同时实现更大、更快和更高效的L1缓存。SIPT缓存推测页面偏移量之外的几个地址位的值，并同时进行地址转换，在消除VIPT设计约束的同时维护VIPT缓存的整体安全和可靠的体系结构。SIPT是一种纯微架构方法，可用于任何软件和所有访问。我们通过在标准Linux下模拟应用程序来评估SIPT。SIPT平均提高了8.1%的性能，减少了15.6%的总缓存层次能量。

{"title":"SIPT: Speculatively Indexed, Physically Tagged Caches","authors":"Tianhao Zheng, Haishan Zhu, M. Erez","doi":"10.1109/HPCA.2018.00020","DOIUrl":"https://doi.org/10.1109/HPCA.2018.00020","url":null,"abstract":"First-level (L1) data cache access latency is critical to performance because it services the vast majority of loads and stores. To keep L1 latency low while ensuring low-complexity and simple-to-verify operation, current processors most-typically utilize a virtually-indexed physically-tagged (VIPT) cache architecture. While VIPT caches decrease latency by proceeding with cache access and address translation concurrently, each cache way is constrained by the size of a virtual page. Thus, larger L1 caches are highly-associative, which degrades their access latency and energy. We propose speculatively-indexed physically-tagged (SIPT) caches to enable simultaneously larger, faster, and more efficient L1 caches. A SIPT cache speculates on the value of a few address bits beyond the page offset concurrently with address translation, maintaining the overall safe and reliable architecture of a VIPT cache while eliminating the VIPT design constraints. SIPT is a purely microarchitectural approach that can be used with any software and for all accesses. We evaluate SIPT with simulations of applications under standard Linux. SIPT improves performance by 8.1% on average and reduces total cache-hierarchy energy by 15.6%.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115222964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15