Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors

S. Rai, Mainak Chaudhuri
{"title":"Exploiting Dynamic Reuse Probability to Manage Shared Last-level Caches in CPU-GPU Heterogeneous Processors","authors":"S. Rai, Mainak Chaudhuri","doi":"10.1145/2925426.2926266","DOIUrl":null,"url":null,"abstract":"Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different resource requirements, challenging resource partitioning problems arise in such heterogeneous CMPs. In one class of designs, the CPU and the GPU cores share the large on-die last-level SRAM cache. In this paper, we explore mechanisms to dynamically allocate the shared last-level cache space to the CPU and GPU applications in such designs. A CPU core executes an instruction progressively in a pipeline generating memory accesses (for instruction and data) only in a few pipeline stages. On the other hand, a GPU can access different data streams having different semantic meanings and disparate access patterns throughout the rendering pipeline. Such data streams include input vertex, pixel depth, pixel color, texture map, shader instructions, shader data (including shader register spills and fills), etc.. Without carefully designed last-level cache management policies, the CPU and the GPU data streams can interfere with each other leading to significant loss in CPU and GPU performance accompanied by degradation in GPU-rendered 3D animation quality. Our proposal dynamically estimates the reuse probabilities of the GPU streams as well as the CPU data by sampling portions of the CPU and GPU working sets and storing the sampled tags in a small working set sample cache. Since the GPU application working sets are typically very large, for this working set sample cache to be effective, it is custom-designed to have large coverage while requiring few tens of kilobytes of storage. We use the estimated reuse probabilities to design shared last-level cache policies for handling hits and misses to reads and writes from both types of cores. Studies on a detailed heterogeneous CMP simulator show that compared to a state-of-the-art baseline with a 16 MB shared last-level cache, our proposal can improve the performance (frame rate or execution cycles, as applicable) of eighteen GPU workloads spanning DirectX and OpenGL game titles as well as CUDA applications by 12% on average and up to 51% while improving the performance of the co-running quad-core CPU workload mixes by 7% on average and up to 19%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926266","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Recent commercial chip-multiprocessors (CMPs) have integrated CPU as well as GPU cores on the same die. In today's designs, these cores typically share parts of the memory system resources. However, since the CPU and the GPU cores have vastly different resource requirements, challenging resource partitioning problems arise in such heterogeneous CMPs. In one class of designs, the CPU and the GPU cores share the large on-die last-level SRAM cache. In this paper, we explore mechanisms to dynamically allocate the shared last-level cache space to the CPU and GPU applications in such designs. A CPU core executes an instruction progressively in a pipeline generating memory accesses (for instruction and data) only in a few pipeline stages. On the other hand, a GPU can access different data streams having different semantic meanings and disparate access patterns throughout the rendering pipeline. Such data streams include input vertex, pixel depth, pixel color, texture map, shader instructions, shader data (including shader register spills and fills), etc.. Without carefully designed last-level cache management policies, the CPU and the GPU data streams can interfere with each other leading to significant loss in CPU and GPU performance accompanied by degradation in GPU-rendered 3D animation quality. Our proposal dynamically estimates the reuse probabilities of the GPU streams as well as the CPU data by sampling portions of the CPU and GPU working sets and storing the sampled tags in a small working set sample cache. Since the GPU application working sets are typically very large, for this working set sample cache to be effective, it is custom-designed to have large coverage while requiring few tens of kilobytes of storage. We use the estimated reuse probabilities to design shared last-level cache policies for handling hits and misses to reads and writes from both types of cores. Studies on a detailed heterogeneous CMP simulator show that compared to a state-of-the-art baseline with a 16 MB shared last-level cache, our proposal can improve the performance (frame rate or execution cycles, as applicable) of eighteen GPU workloads spanning DirectX and OpenGL game titles as well as CUDA applications by 12% on average and up to 51% while improving the performance of the co-running quad-core CPU workload mixes by 7% on average and up to 19%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用动态重用概率管理CPU-GPU异构处理器共享的最后一级缓存
最近的商用芯片多处理器(cmp)在同一个芯片上集成了CPU和GPU内核。在今天的设计中,这些核心通常共享部分内存系统资源。然而,由于CPU和GPU内核的资源需求有很大的不同,在这种异构cmp中出现了具有挑战性的资源分区问题。在一类设计中,CPU和GPU内核共享大型片上最后一级SRAM缓存。在本文中,我们探讨了在这种设计中动态分配共享的最后一级缓存空间给CPU和GPU应用程序的机制。CPU核心在管道中逐步执行指令,仅在几个管道阶段中生成内存访问(用于指令和数据)。另一方面,GPU可以在整个渲染管道中访问具有不同语义和不同访问模式的不同数据流。这些数据流包括输入顶点、像素深度、像素颜色、纹理图、着色器指令、着色器数据(包括着色器寄存器溢出和填充)等。如果没有精心设计的最后一级缓存管理策略,CPU和GPU数据流可能会相互干扰,导致CPU和GPU性能的显著下降,同时GPU渲染的3D动画质量也会下降。我们的建议动态估计GPU流和CPU数据的重用概率,方法是对CPU和GPU工作集的部分进行采样,并将采样的标签存储在一个小的工作集样本缓存中。由于GPU应用程序工作集通常非常大,为了使此工作集样本缓存有效,它被定制设计为具有较大的覆盖范围,同时需要几十kb的存储空间。我们使用估计的重用概率来设计共享的最后一级缓存策略,以处理来自两种类型核心的读和写的命中和未命中。在详细的异构CMP模拟器上的研究表明,与具有16 MB共享最后一级缓存的最先进基线相比,我们的建议可以将跨越DirectX和OpenGL游戏以及CUDA应用程序的18个GPU工作负载的性能(帧率或执行周期,如适用)平均提高12%和高达51%,同时将共同运行的四核CPU工作负载混合的性能平均提高7%和高达19%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Prefetching Techniques for Near-memory Throughput Processors Polly-ACC Transparent compilation to heterogeneous hardware Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Parallel Transposition of Sparse Data Structures Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1