了解您的限制:通过调度管理大规模多线程缓存

Commun. ACM Pub Date : 2014-11-26 DOI:10.1145/2682583
Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt
{"title":"了解您的限制:通过调度管理大规模多线程缓存","authors":"Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt","doi":"10.1145/2682583","DOIUrl":null,"url":null,"abstract":"The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.","PeriodicalId":10645,"journal":{"name":"Commun. ACM","volume":"14 1","pages":"91-98"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Learning your limit: managing massively multithreaded caches through scheduling\",\"authors\":\"Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt\",\"doi\":\"10.1145/2682583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.\",\"PeriodicalId\":10645,\"journal\":{\"name\":\"Commun. ACM\",\"volume\":\"14 1\",\"pages\":\"91-98\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Commun. ACM\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2682583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Commun. ACM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2682583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

在过去的三十年里,处理器和内存性能之间的差距已经成为微处理器研究和发展的焦点。现代架构使用两种正交的方法来帮助缓解这个问题:(1)几乎每个微处理器都包含某种形式的片上存储,通常以缓存的形式,以减少内存延迟并更有效地利用有限的内存带宽。(2)大规模多线程架构,如图形处理单元(gpu),试图通过直接在硬件中快速切换多个线程来隐藏对内存的高延迟。本文将探讨这两种技术的交集。我们研究了在大规模多线程GPU上加速具有显著局部性的高度并行工作负载的影响。我们观察到,片上缓存看到的内存访问流是硬件线程调度器所做决策的直接结果。我们的工作提出了一种硬件调度技术,该技术对来自内存系统的反馈做出反应,以创建一个更缓存友好的访问流。我们使用模拟评估了我们的技术,并显示了比以前提出的调度机制有显着的性能改进。通过比较使用我们的调度器和LRU替换策略的缓存命中率与使用最优缓存替换策略的其他调度技术,我们展示了调度作为一种缓存管理技术的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Learning your limit: managing massively multithreaded caches through scheduling
The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Internet of Things Security and Privacy Labels Should Empower Consumers The Internet of Batteryless Things Software Managers' Guide to Operational Excellence Co-Developing Programs and Their Proof of Correctness Generative AI Degrades Online Communities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1