Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926266

S. Rai, Mainak Chaudhuri

{"title":"Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors","authors":"S. Rai, Mainak Chaudhuri","doi":"10.1145/2925426.2926266","DOIUrl":null,"url":null,"abstract":"Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different resource requirements, challenging resource partitioning problems arise in such heterogeneous CMPs. In one class of designs, the CPU and the GPU cores share the large on-die last-level SRAM cache. In this paper, we explore mechanisms to dynamically allocate the shared last-level cache space to the CPU and GPU applications in such designs. A CPU core executes an instruction progressively in a pipeline generating memory accesses (for instruction and data) only in a few pipeline stages. On the other hand, a GPU can access different data streams having different semantic meanings and disparate access patterns throughout the rendering pipeline. Such data streams include input vertex, pixel depth, pixel color, texture map, shader instructions, shader data (including shader register spills and fills), etc.. Without carefully designed last-level cache management policies, the CPU and the GPU data streams can interfere with each other leading to significant loss in CPU and GPU performance accompanied by degradation in GPU-rendered 3D animation quality. Our proposal dynamically estimates the reuse probabilities of the GPU streams as well as the CPU data by sampling portions of the CPU and GPU working sets and storing the sampled tags in a small working set sample cache. Since the GPU application working sets are typically very large, for this working set sample cache to be effective, it is custom-designed to have large coverage while requiring few tens of kilobytes of storage. We use the estimated reuse probabilities to design shared last-level cache policies for handling hits and misses to reads and writes from both types of cores. Studies on a detailed heterogeneous CMP simulator show that compared to a state-of-the-art baseline with a 16 MB shared last-level cache, our proposal can improve the performance (frame rate or execution cycles, as applicable) of eighteen GPU workloads spanning DirectX and OpenGL game titles as well as CUDA applications by 12% on average and up to 51% while improving the performance of the co-running quad-core CPU workload mixes by 7% on average and up to 19%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different resource requirements, challenging resource partitioning problems arise in such heterogeneous CMPs. In one class of designs, the CPU and the GPU cores share the large on-die last-level SRAM cache. In this paper, we explore mechanisms to dynamically allocate the shared last-level cache space to the CPU and GPU applications in such designs. A CPU core executes an instruction progressively in a pipeline generating memory accesses (for instruction and data) only in a few pipeline stages. On the other hand, a GPU can access different data streams having different semantic meanings and disparate access patterns throughout the rendering pipeline. Such data streams include input vertex, pixel depth, pixel color, texture map, shader instructions, shader data (including shader register spills and fills), etc.. Without carefully designed last-level cache management policies, the CPU and the GPU data streams can interfere with each other leading to significant loss in CPU and GPU performance accompanied by degradation in GPU-rendered 3D animation quality. Our proposal dynamically estimates the reuse probabilities of the GPU streams as well as the CPU data by sampling portions of the CPU and GPU working sets and storing the sampled tags in a small working set sample cache. Since the GPU application working sets are typically very large, for this working set sample cache to be effective, it is custom-designed to have large coverage while requiring few tens of kilobytes of storage. We use the estimated reuse probabilities to design shared last-level cache policies for handling hits and misses to reads and writes from both types of cores. Studies on a detailed heterogeneous CMP simulator show that compared to a state-of-the-art baseline with a 16 MB shared last-level cache, our proposal can improve the performance (frame rate or execution cycles, as applicable) of eighteen GPU workloads spanning DirectX and OpenGL game titles as well as CUDA applications by 12% on average and up to 51% while improving the performance of the co-running quad-core CPU workload mixes by 7% on average and up to 19%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用动态重用概率管理CPU-GPU异构处理器共享的最后一级缓存

最近的商用芯片多处理器(cmp)在同一个芯片上集成了CPU和GPU内核。在今天的设计中，这些核心通常共享部分内存系统资源。然而，由于CPU和GPU内核的资源需求有很大的不同，在这种异构cmp中出现了具有挑战性的资源分区问题。在一类设计中，CPU和GPU内核共享大型片上最后一级SRAM缓存。在本文中，我们探讨了在这种设计中动态分配共享的最后一级缓存空间给CPU和GPU应用程序的机制。CPU核心在管道中逐步执行指令，仅在几个管道阶段中生成内存访问(用于指令和数据)。另一方面，GPU可以在整个渲染管道中访问具有不同语义和不同访问模式的不同数据流。这些数据流包括输入顶点、像素深度、像素颜色、纹理图、着色器指令、着色器数据(包括着色器寄存器溢出和填充)等。如果没有精心设计的最后一级缓存管理策略，CPU和GPU数据流可能会相互干扰，导致CPU和GPU性能的显著下降，同时GPU渲染的3D动画质量也会下降。我们的建议动态估计GPU流和CPU数据的重用概率，方法是对CPU和GPU工作集的部分进行采样，并将采样的标签存储在一个小的工作集样本缓存中。由于GPU应用程序工作集通常非常大，为了使此工作集样本缓存有效，它被定制设计为具有较大的覆盖范围，同时需要几十kb的存储空间。我们使用估计的重用概率来设计共享的最后一级缓存策略，以处理来自两种类型核心的读和写的命中和未命中。在详细的异构CMP模拟器上的研究表明，与具有16 MB共享最后一级缓存的最先进基线相比，我们的建议可以将跨越DirectX和OpenGL游戏以及CUDA应用程序的18个GPU工作负载的性能(帧率或执行周期，如适用)平均提高12%和高达51%，同时将共同运行的四核CPU工作负载混合的性能平均提高7%和高达19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量

期刊最新文献

Prefetching Techniques for Near-memory Throughput Processors Polly-ACC Transparent compilation to heterogeneous hardware Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Parallel Transposition of Sparse Data Structures Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics