2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)最新文献_第5页

Featherlight Reuse-Distance Measurement 轻量级重用-距离测量

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00056

Qingsen Wang, Xu Liu, Milind Chabbi

Data locality has a profound impact on program performance. Reuse distance—the number of distinct memory locations accessed between two consecutive accesses to the same location—is the de facto, machine-independent metric of data locality in a program. Reuse distance measurement, typically, requires exhaustive instrumentation (code or binary) to log every memory access, which results in orders of magnitude runtime slowdown and memory bloat. Such high overheads impede reuse distance tools from adoption in long-running, production applications despite their usefulness. We develop RDX, a lightweight profiling tool for characterizing reuse distance in an execution; RDX typically incurs negligible time (5%) and memory (7%) overheads. RDX performs no instrumentation whatsoever but uniquely combines hardware performance counter sampling with hardware debug registers, both available in commodity CPU processors, to produce reuse-distance histograms. RDX typically has more than 90% accuracy compared to the ground truth. With the help of RDX, we are the first to characterize memory performance of long-running SPEC CPU2017 benchmarks. Keywords-Reuse distance; locality; hardware performance counters; debug registers; profiling.

数据局部性对程序性能有深远的影响。重用距离(在对同一位置的两次连续访问之间访问的不同内存位置的数量)实际上是程序中与机器无关的数据位置度量。重用距离测量通常需要详尽的检测(代码或二进制)来记录每次内存访问，这会导致运行时速度减慢和内存膨胀。如此高的开销阻碍了在长时间运行的生产应用程序中采用重用远程工具，尽管它们很有用。我们开发了RDX，一个用于描述执行过程中重用距离的轻量级分析工具;RDX通常会导致可以忽略不计的时间开销(5%)和内存开销(7%)。RDX不执行任何检测，而是独特地将硬件性能计数器采样与硬件调试寄存器(两者都可以在商品CPU处理器中使用)结合起来，以产生重用距离直方图。与地面实况相比，RDX通常具有90%以上的准确性。在RDX的帮助下，我们率先对长期运行的SPEC CPU2017基准测试的内存性能进行了表征。Keywords-Reuse距离;位置;硬件性能计数器;调试寄存器;剖析。

{"title":"Featherlight Reuse-Distance Measurement","authors":"Qingsen Wang, Xu Liu, Milind Chabbi","doi":"10.1109/HPCA.2019.00056","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00056","url":null,"abstract":"Data locality has a profound impact on program performance. Reuse distance—the number of distinct memory locations accessed between two consecutive accesses to the same location—is the de facto, machine-independent metric of data locality in a program. Reuse distance measurement, typically, requires exhaustive instrumentation (code or binary) to log every memory access, which results in orders of magnitude runtime slowdown and memory bloat. Such high overheads impede reuse distance tools from adoption in long-running, production applications despite their usefulness. We develop RDX, a lightweight profiling tool for characterizing reuse distance in an execution; RDX typically incurs negligible time (5%) and memory (7%) overheads. RDX performs no instrumentation whatsoever but uniquely combines hardware performance counter sampling with hardware debug registers, both available in commodity CPU processors, to produce reuse-distance histograms. RDX typically has more than 90% accuracy compared to the ground truth. With the help of RDX, we are the first to characterize memory performance of long-running SPEC CPU2017 benchmarks. Keywords-Reuse distance; locality; hardware performance counters; debug registers; profiling.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114192747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

PageSeer: Using Page Walks to Trigger Page Swaps in Hybrid Memory Systems PageSeer:在混合内存系统中使用页遍历来触发页交换

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00012

Apostolos Kokolis, Dimitrios Skarlatos, J. Torrellas

Hybrid main memories composed of DRAM and NonVolatile Memory (NVM) combine the capacity benefits of NVM with the low-latency properties of DRAM. For highest performance, data segments should be exchanged between the two types of memories dynamically—a process known as segment swapping—based on the access patterns to the segments in the program. The key difficulty in hardwaremanaged swapping is to identify the appropriate segments to swap between the memories at the right time in the execution. To perform hardware-managed segment swapping both accurately and with substantial lead time, this paper proposes to use hints from the page walk in a TLB miss. We call the scheme PageSeer. During the generation of the physical address for a page in a TLB miss, the memory controller is informed. The controller uses historic data on the accesses to that page and to a subsequently-referenced page (i.e., its follower page), to potentially initiate swaps for the page and for its follower. We call these actions MMU-Triggered Prefetch Swaps. PageSeer also initiates other types of page swaps, building a complete solution for hybrid memory. Our evaluation of PageSeer with simulations of 26 workloads shows that PageSeer effectively hides the swap overhead and services many requests from the DRAM. Compared to a state-of-the-art hardware-only scheme for hybrid memory management, PageSeer on average improves performance by 19% and reduces the average main memory access time by 29%. Keywords-Hybrid Memory Systems; Non-Volatile Memory; Virtual Memory; Page Walks; Page Swapping.

由DRAM和非易失性内存(NVM)组成的混合主存结合了NVM的容量优势和DRAM的低延迟特性。为了获得最高性能，应该根据程序中对段的访问模式在两种类型的内存之间动态地交换数据段——这是一个称为段交换的过程。硬件管理交换的主要困难是在执行过程中确定在适当的时间在内存之间交换的适当段。为了准确地执行硬件管理的段交换，并且节省大量的前置时间，本文建议在TLB miss中使用来自页遍历的提示，我们称之为PageSeer方案。在生成TLB miss中页的物理地址期间，内存控制器被通知。控制器在访问该页和随后引用的页(即其后续页)时使用历史数据，以潜在地启动该页及其后续页的交换。我们称这些动作为mmu触发的预取交换。PageSeer还启动了其他类型的页面交换，为混合内存构建了一个完整的解决方案。我们通过模拟26种工作负载对PageSeer进行的评估表明，PageSeer有效地隐藏了交换开销，并为来自DRAM的许多请求提供服务。与最先进的纯硬件混合内存管理方案相比，PageSeer平均提高了19%的性能，并将平均主内存访问时间减少了29%。关键词:混合存储系统;非易失性内存;虚拟内存;页面走;页面交换。

{"title":"PageSeer: Using Page Walks to Trigger Page Swaps in Hybrid Memory Systems","authors":"Apostolos Kokolis, Dimitrios Skarlatos, J. Torrellas","doi":"10.1109/HPCA.2019.00012","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00012","url":null,"abstract":"Hybrid main memories composed of DRAM and NonVolatile Memory (NVM) combine the capacity benefits of NVM with the low-latency properties of DRAM. For highest performance, data segments should be exchanged between the two types of memories dynamically—a process known as segment swapping—based on the access patterns to the segments in the program. The key difficulty in hardwaremanaged swapping is to identify the appropriate segments to swap between the memories at the right time in the execution. To perform hardware-managed segment swapping both accurately and with substantial lead time, this paper proposes to use hints from the page walk in a TLB miss. We call the scheme PageSeer. During the generation of the physical address for a page in a TLB miss, the memory controller is informed. The controller uses historic data on the accesses to that page and to a subsequently-referenced page (i.e., its follower page), to potentially initiate swaps for the page and for its follower. We call these actions MMU-Triggered Prefetch Swaps. PageSeer also initiates other types of page swaps, building a complete solution for hybrid memory. Our evaluation of PageSeer with simulations of 26 workloads shows that PageSeer effectively hides the swap overhead and services many requests from the DRAM. Compared to a state-of-the-art hardware-only scheme for hybrid memory management, PageSeer on average improves performance by 19% and reduces the average main memory access time by 29%. Keywords-Hybrid Memory Systems; Non-Volatile Memory; Virtual Memory; Page Walks; Page Swapping.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116636858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Recycling Data Slack in Out-of-Order Cores 在乱序核中回收数据松弛

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00065

Gokul Subramanian Ravi, Mikko H. Lipasti

In order to operate reliably and produce expected outputs, modern processors set timing margins conservatively at design time to support extreme variations in workload and environment, imposing a high cost in performance and energy efficiency. The relentless pressure to improve execution bandwidth has exacerbated this problem, requiring instructions with increasingly diverse semantics, leading to datapaths with a large gap between best-case and worst-case timing. In practice, data slack, the unutilized portion of the clock period due to inactive critical paths in a circuit, can often be as high as half of the clock period. In this paper we propose ReDSOC, which dynamically identifies data slack and aggressively recycles it, to improve performance on Out-Of-Order (OOO) cores. It is implemented via a transparent-flow based data bypass network between the execution units of the core. Further, ReDSOC performs slackaware OOO instruction scheduling aided by optimizations to the wakeup and select logic, to support this aggressive operation execution mechanism. ReDSOC is implemented atop OOO cores of different sizes and tested on a variety of general purpose and machine learning applications. The implementation achieves average speedups in the range of 5% to 25% across the different cores and application categories. Further, it is shown to be more efficient at improving performance in comparison to prior proposals. Keywords-clock cycle slack; out-of-order; scheduler; transparent dataflow;

为了可靠地运行并产生预期的输出，现代处理器在设计时保守地设置时间余量，以支持工作负载和环境的极端变化，从而在性能和能源效率方面付出了高昂的代价。不断提高执行带宽的压力加剧了这个问题，要求指令具有越来越多样化的语义，导致数据路径在最佳情况和最坏情况之间存在很大的时间差。在实践中，由于电路中不活跃的关键路径导致的时钟周期未利用的部分数据松弛，通常可以高达时钟周期的一半。在本文中，我们提出了ReDSOC，它动态识别数据松弛并积极回收它，以提高无序(OOO)内核的性能。它是通过核心执行单元之间基于透明流的数据旁路网络实现的。此外，ReDSOC通过对唤醒和选择逻辑的优化来执行懈怠的OOO指令调度，以支持这种积极的操作执行机制。ReDSOC在不同大小的OOO内核上实现，并在各种通用和机器学习应用程序上进行了测试。该实现在不同的核心和应用程序类别中实现了5%到25%的平均加速。此外，与以前的建议相比，它在提高性能方面更有效。关键词:时钟周期松弛;无序;调度器;透明的数据流;

{"title":"Recycling Data Slack in Out-of-Order Cores","authors":"Gokul Subramanian Ravi, Mikko H. Lipasti","doi":"10.1109/HPCA.2019.00065","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00065","url":null,"abstract":"In order to operate reliably and produce expected outputs, modern processors set timing margins conservatively at design time to support extreme variations in workload and environment, imposing a high cost in performance and energy efficiency. The relentless pressure to improve execution bandwidth has exacerbated this problem, requiring instructions with increasingly diverse semantics, leading to datapaths with a large gap between best-case and worst-case timing. In practice, data slack, the unutilized portion of the clock period due to inactive critical paths in a circuit, can often be as high as half of the clock period. In this paper we propose ReDSOC, which dynamically identifies data slack and aggressively recycles it, to improve performance on Out-Of-Order (OOO) cores. It is implemented via a transparent-flow based data bypass network between the execution units of the core. Further, ReDSOC performs slackaware OOO instruction scheduling aided by optimizations to the wakeup and select logic, to support this aggressive operation execution mechanism. ReDSOC is implemented atop OOO cores of different sizes and tested on a variety of general purpose and machine learning applications. The implementation achieves average speedups in the range of 5% to 25% across the different cores and application categories. Further, it is shown to be more efficient at improving performance in comparison to prior proposals. Keywords-clock cycle slack; out-of-order; scheduler; transparent dataflow;","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126786807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning 平衡:使用机器学习在gpu中平衡线程级并行性和内存系统性能

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00061

Saumay Dublish, V. Nagarajan, N. Topham

GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning

gpu采用高度的线程级并行性(TLP)来隐藏内存操作的长延迟。然而，随之而来的对内存系统需求的增加会导致诸如缓存抖动和带宽瓶颈之类的病态影响。因此，高度的TLP会对系统吞吐量产生不利影响。在本文中，我们提出了平衡gpu中TLP和存储系统性能的新方法Poise。Poise有两个主要组成部分:一个机器学习框架和一个硬件推理引擎。机器学习框架包括一个回归模型，该模型在一组概要内核上进行离线训练，以学习最佳的warp调度决策。在运行时，硬件推理引擎使用先前学习的模型动态预测未见过的应用程序的最佳翘曲调度决策。因此，Poise可以帮助优化全新的应用程序，而不会给最终用户带来任何分析、培训或编程负担。在训练期间未见的一组基准测试中，Poise实现了高达2.94倍的加速，谐波平均加速为46.6%，超过了基线贪婪的最古老的经纱调度器。Poise非常轻量级，每个SM的硬件开销最小，约为41字节。它还使总能耗平均降低51.6%。此外，Poise的性能比之前最先进的曲速调度器平均高出15.1%。实际上，Poise以相当的精度和效率解决了复杂的硬件优化问题。Keywords-warp调度;缓存;机器学习

{"title":"Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/HPCA.2019.00061","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00061","url":null,"abstract":"GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131347833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads 图形处理工作负载的内存层次分析与优化

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00051

Abanti Basak, Shuangchen Li, Xing Hu, Sangmin Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, Yuan Xie

—Graph processing is an important analysis tech- nique for a wide range of big data applications. The ability to explicitly represent relationships between entities gives graph analytics a signiﬁcant performance advantage over traditional relational databases. However, at the microarchitecture level, performance is bounded by the inefﬁciencies in the memory subsystem for single-machine in-memory graph analytics. This paper consists of two contributions in which we analyze and optimize the memory hierarchy for graph processing workloads. First,we perform an in-depth data-type-aware characteriza- tion of graph processing workloads on a simulated multi-core architecture. We analyze 1) the memory-level parallelism in an out-of-order core and 2) the request reuse distance in the cache hierarchy. We ﬁnd that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism. We also observe that different graph data types exhibit heterogeneous reuse distances. As a result, the private L2 cache has negligible contribution to performance, whereas the shared L3 cache shows higher performance sensitivity. Second, based on our proﬁling observations, we propose DROPLET, a Data-awaRe decOuPLed prEfeTcher for graph applications. DROPLET prefetches different graph data types differently according to their inherent reuse distances. In addition, DROPLET is physically decoupled to overcome the serialization due to the dependency chains between different data types. DROPLET achieves 19%-102% performance im- provement over a no-prefetch baseline, 9%-74% performance improvement over a conventional stream prefetcher, 14%-74% performance improvement

图处理是一种重要的分析技术，适用于广泛的大数据应用。与传统关系数据库相比，显式表示实体之间关系的能力使图分析具有显著的性能优势。然而，在微体系结构级别上，性能受到用于单机内存图分析的内存子系统效率低下的限制。本文由两部分组成，其中我们分析和优化了图形处理工作负载的内存层次结构。首先，我们在模拟的多核架构上对图形处理工作负载进行了深入的数据类型感知表征。我们分析了1)乱序核中的内存级并行性和2)缓存层次结构中的请求重用距离。我们发现，涉及不同应用程序数据类型的负载-负载依赖链构成了实现高内存级并行性的主要瓶颈。我们还观察到不同的图数据类型表现出不同的重用距离。因此，私有L2缓存对性能的贡献可以忽略不计，而共享L3缓存表现出更高的性能灵敏度。其次，基于我们的分析观察，我们提出了用于图形应用程序的数据感知解耦预取器DROPLET。DROPLET根据不同的图形数据类型的固有重用距离来预取不同的数据类型。此外，DROPLET在物理上解耦，以克服由于不同数据类型之间的依赖链而导致的序列化。DROPLET比无预取基准性能提高19%-102%，比传统流预取性能提高9%-74%，性能提高14%-74%

{"title":"Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads","authors":"Abanti Basak, Shuangchen Li, Xing Hu, Sangmin Oh, Xinfeng Xie, Li Zhao, Xiaowei Jiang, Yuan Xie","doi":"10.1109/HPCA.2019.00051","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00051","url":null,"abstract":"—Graph processing is an important analysis tech- nique for a wide range of big data applications. The ability to explicitly represent relationships between entities gives graph analytics a signiﬁcant performance advantage over traditional relational databases. However, at the microarchitecture level, performance is bounded by the inefﬁciencies in the memory subsystem for single-machine in-memory graph analytics. This paper consists of two contributions in which we analyze and optimize the memory hierarchy for graph processing workloads. First,we perform an in-depth data-type-aware characteriza- tion of graph processing workloads on a simulated multi-core architecture. We analyze 1) the memory-level parallelism in an out-of-order core and 2) the request reuse distance in the cache hierarchy. We ﬁnd that the load-load dependency chains involving different application data types form the primary bottleneck in achieving a high memory-level parallelism. We also observe that different graph data types exhibit heterogeneous reuse distances. As a result, the private L2 cache has negligible contribution to performance, whereas the shared L3 cache shows higher performance sensitivity. Second, based on our proﬁling observations, we propose DROPLET, a Data-awaRe decOuPLed prEfeTcher for graph applications. DROPLET prefetches different graph data types differently according to their inherent reuse distances. In addition, DROPLET is physically decoupled to overcome the serialization due to the dependency chains between different data types. DROPLET achieves 19%-102% performance im- provement over a no-prefetch baseline, 9%-74% performance improvement over a conventional stream prefetcher, 14%-74% performance improvement","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130871131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data 基于fpga的加密数据同态计算高性能并行架构

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00052

S. Roy, Furkan Turan, K. Järvinen, F. Vercauteren, I. Verbauwhede

—Homomorphic encryption is a tool that enables computation on encrypted data and thus has applications in privacy-preserving cloud computing. Though conceptually amaz- ing, implementation of homomorphic encryption is very challeng-ing and typically software implementations on general purpose computers are extremely slow. In this paper we present our year long effort to design a domain speciﬁc architecture in a heterogeneous Arm+FPGA platform to accelerate homomorphic computing on encrypted data. We design a custom co-processor for the computationally expensive operations of the well-known Fan-Vercauteren (FV) homomorphic encryption scheme on the FPGA, and make the Arm processor a server for executing different homomorphic applications in the cloud, using this FPGA-based co-processor. We use the most recent arithmetic and algorithmic optimization techniques and perform design- space exploration on different levels of the implementation hierarchy. In particular we apply circuit-level and block-level pipeline strategies to boost the clock frequency and increase the throughput respectively. To reduce computation latency, we use parallel processing at all levels. Starting from the highly optimized building blocks, we gradually build our multi-core multi-processor architecture for computing. We implemented and tested our optimized domain speciﬁc programmable architecture on a single Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. At 200 MHz FPGA-clock, our implementation achieves over 13x speedup with respect to a highly optimized software implementation of the FV homomorphic encryption scheme on an Intel i5 processor running at 1.8 GHz.

同态加密是一种能够对加密数据进行计算的工具，因此在保护隐私的云计算中有应用。虽然同态加密的实现在概念上很神奇，但它非常具有挑战性，而且在通用计算机上的软件实现通常非常缓慢。在本文中，我们展示了我们一年来在异构Arm+FPGA平台上设计特定领域架构的努力，以加速对加密数据的同态计算。我们设计了一个定制的协处理器，用于FPGA上著名的Fan-Vercauteren (FV)同态加密方案的计算昂贵的操作，并使用基于FPGA的协处理器使Arm处理器成为在云中执行不同同态应用程序的服务器。我们使用最新的算法和算法优化技术，并在实现层次的不同层次上进行设计空间探索。特别地，我们应用电路级和块级管道策略分别提高时钟频率和提高吞吐量。为了减少计算延迟，我们在所有级别上使用并行处理。从高度优化的构建模块开始，我们逐步构建我们的多核多处理器计算架构。我们在单个Xilinx Zynq UltraScale+ MPSoC ZCU102评估套件上实现并测试了我们优化的特定领域可编程架构。在200 MHz fpga时钟下，我们的实现实现了超过13倍的加速，相对于在1.8 GHz运行的Intel i5处理器上高度优化的FV同态加密方案的软件实现。

{"title":"FPGA-Based High-Performance Parallel Architecture for Homomorphic Computing on Encrypted Data","authors":"S. Roy, Furkan Turan, K. Järvinen, F. Vercauteren, I. Verbauwhede","doi":"10.1109/HPCA.2019.00052","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00052","url":null,"abstract":"—Homomorphic encryption is a tool that enables computation on encrypted data and thus has applications in privacy-preserving cloud computing. Though conceptually amaz- ing, implementation of homomorphic encryption is very challeng-ing and typically software implementations on general purpose computers are extremely slow. In this paper we present our year long effort to design a domain speciﬁc architecture in a heterogeneous Arm+FPGA platform to accelerate homomorphic computing on encrypted data. We design a custom co-processor for the computationally expensive operations of the well-known Fan-Vercauteren (FV) homomorphic encryption scheme on the FPGA, and make the Arm processor a server for executing different homomorphic applications in the cloud, using this FPGA-based co-processor. We use the most recent arithmetic and algorithmic optimization techniques and perform design- space exploration on different levels of the implementation hierarchy. In particular we apply circuit-level and block-level pipeline strategies to boost the clock frequency and increase the throughput respectively. To reduce computation latency, we use parallel processing at all levels. Starting from the highly optimized building blocks, we gradually build our multi-core multi-processor architecture for computing. We implemented and tested our optimized domain speciﬁc programmable architecture on a single Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit. At 200 MHz FPGA-clock, our implementation achieves over 13x speedup with respect to a highly optimized software implementation of the FV homomorphic encryption scheme on an Intel i5 processor running at 1.8 GHz.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131890581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

Gables: A Roofline Model for Mobile SoCs Gables:移动soc的屋顶线模型

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00047

M. Hill, V. Reddi

Over a billion mobile consumer system-on-chip (SoC) chipsets ship each year. Of these, the mobile consumer market undoubtedly involving smartphones has a significant market share. Most modern smartphones comprise of advanced SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators connected via a complex hierarchy of interconnects with the goal of running a dozen or more critical software usecases under strict power, thermal and energy constraints. The steadily growing complexity of a modern SoC challenges hardware computer architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once the hardware is specified and accelerator software is written or ported. However, early-stage SoC design must often select accelerators before a single line of software is written. To help frame SoC thinking and guide early stage mobile SoC design, in this paper we contribute the Gables model that refines and retargets the Roofline model—designed originally for the performance and bandwidth limits of a multicore chip—to model each accelerator on a SoC, to apportion work concurrently among different accelerators (justified by our usecase analysis), and calculate a SoC performance upper bound. We evaluate the Gables model with an existing SoC and develop several extensions that allow Gables to inform early stage mobile SoC design.

每年出货的移动消费级系统芯片(SoC)芯片组超过10亿个。其中，涉及智能手机的移动消费市场无疑占有相当大的市场份额。大多数现代智能手机都由先进的SoC架构组成，这些架构由多个核心、GPS和许多不同的可编程和固定功能加速器组成，这些加速器通过复杂的互连层次结构连接起来，目标是在严格的功率、热量和能量限制下运行十几个或更多的关键软件用例。现代SoC不断增长的复杂性对硬件计算机架构师如何最好地进行早期构思提出了挑战。一旦指定了硬件，编写或移植了加速器软件，后期SoC设计通常依赖于详细的全系统仿真。然而，早期的SoC设计通常必须在编写一行软件之前选择加速器。为了帮助构建SoC思维并指导早期的移动SoC设计，在本文中，我们贡献了Gables模型，该模型改进和重新定位了最初为多核芯片的性能和带宽限制而设计的rooline模型，以对SoC上的每个加速器进行建模，在不同的加速器之间并发分配工作(通过我们的用例分析证明)，并计算SoC性能上限。我们用现有的SoC评估了Gables模型，并开发了几个扩展，使Gables能够为早期的移动SoC设计提供信息。

{"title":"Gables: A Roofline Model for Mobile SoCs","authors":"M. Hill, V. Reddi","doi":"10.1109/HPCA.2019.00047","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00047","url":null,"abstract":"Over a billion mobile consumer system-on-chip (SoC) chipsets ship each year. Of these, the mobile consumer market undoubtedly involving smartphones has a significant market share. Most modern smartphones comprise of advanced SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators connected via a complex hierarchy of interconnects with the goal of running a dozen or more critical software usecases under strict power, thermal and energy constraints. The steadily growing complexity of a modern SoC challenges hardware computer architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once the hardware is specified and accelerator software is written or ported. However, early-stage SoC design must often select accelerators before a single line of software is written. To help frame SoC thinking and guide early stage mobile SoC design, in this paper we contribute the Gables model that refines and retargets the Roofline model—designed originally for the performance and bandwidth limits of a multicore chip—to model each accelerator on a SoC, to apportion work concurrently among different accelerators (justified by our usecase analysis), and calculate a SoC performance upper bound. We evaluate the Gables model with an existing SoC and develop several extensions that allow Gables to inform early stage mobile SoC design.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133047653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

The What's Next Intermittent Computing Architecture 下一个间歇性计算架构是什么

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00039

K. Ganesan, Joshua San Miguel, Natalie D. Enright Jerger

Energy-harvesting devices operate under extremely tight energy constraints. Ensuring forward progress under frequent power outages is paramount. Applications running on these devices are typically amenable to approximation, offering new opportunities to provide better forward progress between power outages. We propose What’s Next (WN), a set of anytime approximation techniques for energy harvesting: subword pipelining, subword vectorization and skim points. Skim points fundamentally decouple the checkpoint location from the recovery location upon a power outage. Ultimately, WN transforms processing on energy-harvesting devices from all-or-nothing to as-is computing. We enable an approximate (yet acceptable) result sooner and proceed to the next task when power is restored rather than resume processing from a checkpoint to yield the perfect output. WN yields speedups of 2.26x and 3.02x on non-volatile and checkpoint-based volatile processors, while still producing high-quality outputs. Keywords-energy harvesting; intermittent computing; approximate computing;

能量收集装置在极其严格的能量限制下运行。确保在频繁停电的情况下取得进展是至关重要的。在这些设备上运行的应用程序通常可以进行近似处理，从而提供了在停电期间提供更好的转发进度的新机会。我们提出了下一步(WN)，一套用于能量收集的随时逼近技术:子词管道，子词矢量化和略读点。撇点从根本上将检查点位置与停电时的恢复位置分离开来。最终，WN将能量收集设备上的处理从全有或全无转变为原状计算。我们更快地启用一个近似(但可接受)的结果，并在电源恢复后继续执行下一个任务，而不是从检查点恢复处理以产生完美的输出。WN在非易失性和基于检查点的易失性处理器上的加速分别为2.26倍和3.02倍，同时仍能产生高质量的输出。Keywords-energy收获;断断续续的计算;近似计算;

引用次数: 22

PIM-VR: Erasing Motion Anomalies In Highly-Interactive Virtual Reality World with Customized Memory Cube PIM-VR:在高度交互的虚拟现实世界中使用定制内存立方体消除运动异常

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00013

Chenhao Xie, Xingyao Zhang, Ang Li, Xin Fu, S. Song

With the revolutionary innovations emerging in the computer graphics domain, virtual reality (VR) has become increasingly popular and shown great potential for entertainment, medical simulation and education. In the highly interactive VR world, the motion-to-photon delay (MPD) which represents the delay from users’ head motion to the responded image displayed on their head devices, is the most critical factor for a successful VR experience. Long MPD may cause users to experience significant motion anomalies: judder, lagging and sickness. In order to achieve the short MPD and alleviate the motion anomalies, asynchronous time warp (ATW) which is known as an image re-projection technique, has been proposed by VR vendors to map the previously rendered frame to the correct position using the latest headmotion information. However, after a careful investigation on the efficiency of the current GPU-accelerated ATW through executing real VR applications on modern VR hardware, we observe that the state-of-the-art ATW technique cannot deliver the ideal MPD and often misses the refresh deadline, resulting in reduced frame rate and motion anomalies. This is caused by two major challenges: inefficient VR execution model and intensive off-chip memory accesses. To tackle these, we propose a preemption-free Processing-In-Memory based ATW design which asynchronously executes ATW within a 3D-stacked memory, without interrupting the rendering tasks on the host GPU. We also identify a redundancy reduction mechanism to further simplify and accelerate the ATW operation. A comprehensive evaluation of our proposed design demonstrates that our PIM-based ATW can achieve the ideal MPD and provide superior user experience. Finally, we provide a design space exploration to showcase different design choices for the PIM-based ATW design, and the results show that our design scales well in future VR scenarios with higher frame resolution and even lower ideal MPD.

随着计算机图形学领域的革命性创新，虚拟现实(VR)越来越受欢迎，并在娱乐、医疗模拟和教育方面显示出巨大的潜力。在高度互动的VR世界中，动作到光子延迟(MPD)是VR体验成功的最关键因素，它代表了用户头部运动到头部设备上显示的响应图像的延迟。长MPD可能会导致用户体验到明显的运动异常:抖动、滞后和恶心。为了实现较短的MPD和减轻运动异常，VR厂商提出了一种称为异步时间扭曲(ATW)的图像重投影技术，利用最新的头部运动信息将之前渲染的帧映射到正确的位置。然而，通过在现代VR硬件上执行真实的VR应用程序，对当前gpu加速ATW的效率进行仔细调查后，我们发现最先进的ATW技术无法提供理想的MPD，并且经常错过刷新截止日期，导致帧率降低和运动异常。这主要是由两大挑战造成的:低效的VR执行模型和密集的片外内存访问。为了解决这些问题，我们提出了一种基于内存中无抢占处理的ATW设计，该设计在3d堆叠内存中异步执行ATW，而不会中断主机GPU上的渲染任务。我们亦订定减少冗余的机制，以进一步简化和加快空管系统的运作。对我们提出的设计的综合评估表明，我们基于pim的ATW可以实现理想的MPD，并提供卓越的用户体验。最后，我们提供了一个设计空间探索，展示了基于pim的ATW设计的不同设计选择，结果表明我们的设计在未来的VR场景中具有更高的帧分辨率和更低的理想MPD。

{"title":"PIM-VR: Erasing Motion Anomalies In Highly-Interactive Virtual Reality World with Customized Memory Cube","authors":"Chenhao Xie, Xingyao Zhang, Ang Li, Xin Fu, S. Song","doi":"10.1109/HPCA.2019.00013","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00013","url":null,"abstract":"With the revolutionary innovations emerging in the computer graphics domain, virtual reality (VR) has become increasingly popular and shown great potential for entertainment, medical simulation and education. In the highly interactive VR world, the motion-to-photon delay (MPD) which represents the delay from users’ head motion to the responded image displayed on their head devices, is the most critical factor for a successful VR experience. Long MPD may cause users to experience significant motion anomalies: judder, lagging and sickness. In order to achieve the short MPD and alleviate the motion anomalies, asynchronous time warp (ATW) which is known as an image re-projection technique, has been proposed by VR vendors to map the previously rendered frame to the correct position using the latest headmotion information. However, after a careful investigation on the efficiency of the current GPU-accelerated ATW through executing real VR applications on modern VR hardware, we observe that the state-of-the-art ATW technique cannot deliver the ideal MPD and often misses the refresh deadline, resulting in reduced frame rate and motion anomalies. This is caused by two major challenges: inefficient VR execution model and intensive off-chip memory accesses. To tackle these, we propose a preemption-free Processing-In-Memory based ATW design which asynchronously executes ATW within a 3D-stacked memory, without interrupting the rendering tasks on the host GPU. We also identify a redundancy reduction mechanism to further simplify and accelerate the ATW operation. A comprehensive evaluation of our proposed design demonstrates that our PIM-based ATW can achieve the ideal MPD and provide superior user experience. Finally, we provide a design space exploration to showcase different design choices for the PIM-based ATW design, and the results show that our design scales well in future VR scenarios with higher frame resolution and even lower ideal MPD.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127844591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline 消除图形管道中无效计算的早期可见性解决方案

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2019-02-01 DOI: 10.1109/HPCA.2019.00015

Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, Antonio González

GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.

gpu的主要工作是实时图像渲染。这些应用程序获取(动画)场景的描述并生成相应的图像。通过计算所有像素的颜色来渲染图像。多个对象在每个像素重叠是正常的。因此，尽管在现代gpu中广泛使用早期深度测试，但仍有大量的处理致力于在最终图像中不可见的对象，该测试试图丢弃与遮挡对象相关的计算。由于动画是由一系列相似的图像创建的，可见性通常不会在连续的帧之间发生太大变化。基于这一观察，我们提出了早期能见度分辨率(EVR)，这是一种利用在一帧中获得的能见度信息来预测下一帧能见度的机制。我们的建议推测性地比早期深度测试更早地确定管道的可见性。我们利用这种早期可见性估计来消除两个不同粒度的无效计算:像素级和瓷砖级。结果表明，对于运行在最先进的移动gpu上的一组商业Android图形应用程序，这种优化导致39%的性能提升和43%的能源节约。

{"title":"Early Visibility Resolution for Removing Ineffectual Computations in the Graphics Pipeline","authors":"Martí Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Aragón, Antonio González","doi":"10.1109/HPCA.2019.00015","DOIUrl":"https://doi.org/10.1109/HPCA.2019.00015","url":null,"abstract":"GPUs' main workload is real-time image rendering. These applications take a description of a (animated) scene and produce the corresponding image(s). An image is rendered by computing the colors of all its pixels. It is normal that multiple objects overlap at each pixel. Consequently, a significant amount of processing is devoted to objects that will not be visible in the final image, in spite of the widespread use of the Early Depth Test in modern GPUs, which attempts to discard computations related to occluded objects. Since animations are created by a sequence of similar images, visibility usually does not change much across consecutive frames. Based on this observation, we present Early Visibility Resolution (EVR), a mechanism that leverages the visibility information obtained in a frame to predict the visibility in the following one. Our proposal speculatively determines visibility much earlier in the pipeline than the Early Depth Test. We leverage this early visibility estimation to remove ineffectual computations at two different granularities: pixel-level and tile-level. Results show that such optimizations lead to 39% performance improvement and 43% energy savings for a set of commercial Android graphics applications running on stateof-the-art mobile GPUs.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134050149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7