A detailed GPU cache model based on reuse distance theory

C. Nugteren, Gert-Jan van den Braak, H. Corporaal, H. Bal
{"title":"A detailed GPU cache model based on reuse distance theory","authors":"C. Nugteren, Gert-Jan van den Braak, H. Corporaal, H. Bal","doi":"10.1109/HPCA.2014.6835955","DOIUrl":null,"url":null,"abstract":"As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"115","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2014.6835955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 115

Abstract

As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality system-atically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: (1) the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, (2) conditional and non-uniform latencies, (3) cache associativity, (4) miss-status holding-registers, and (5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于重用距离理论的GPU缓存详细模型
由于现代gpu部分依赖于其片内存储器来对抗即将到来的片外存储器墙,因此有效利用其缓存对于性能和能源变得非常重要。然而,从系统上优化缓存局部性需要洞察和预测缓存行为。在顺序处理器上,堆栈距离或重用距离理论是一种众所周知的对缓存行为建模的方法。然而,将这一理论应用于gpu并不简单,主要是因为并行执行模型和细粒度多线程。这项工作通过建模扩展了GPU的重用距离:(1)GPU的线程层次结构,翘曲,线程块和活动线程集,(2)条件和非均匀延迟,(3)缓存关联性,(4)缺失状态保持寄存器,(5)翘曲发散。我们在c++中实现了该模型,并扩展了Ocelot GPU仿真器来提取内存地址列表。我们将我们的模型与Parboil和PolyBench/GPU基准套件的测量缓存缺失率进行比较,显示两种缓存配置的平均绝对误差为6%和8%。我们表明,与GPGPU-Sim模拟器相比,我们的模型更快,甚至更准确。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Precision-aware soft error protection for GPUs Low-overhead and high coverage run-time race detection through selective meta-data management Improving DRAM performance by parallelizing refreshes with accesses Improving GPGPU resource utilization through alternative thread block scheduling DraMon: Predicting memory bandwidth usage of multi-threaded programs with high accuracy and low overhead
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1