了解您的限制:通过调度管理大规模多线程缓存

Commun. ACM Pub Date : 2014-11-26 DOI:10.1145/2682583

Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt

{"title":"了解您的限制:通过调度管理大规模多线程缓存","authors":"Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt","doi":"10.1145/2682583","DOIUrl":null,"url":null,"abstract":"The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.","PeriodicalId":10645,"journal":{"name":"Commun. ACM","volume":"14 1","pages":"91-98"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Learning your limit: managing massively multithreaded caches through scheduling\",\"authors\":\"Timothy G. Rogers, Mike O'Connor, Tor M. Aamodt\",\"doi\":\"10.1145/2682583\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.\",\"PeriodicalId\":10645,\"journal\":{\"name\":\"Commun. ACM\",\"volume\":\"14 1\",\"pages\":\"91-98\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Commun. ACM\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2682583\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Commun. ACM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2682583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

在过去的三十年里，处理器和内存性能之间的差距已经成为微处理器研究和发展的焦点。现代架构使用两种正交的方法来帮助缓解这个问题:(1)几乎每个微处理器都包含某种形式的片上存储，通常以缓存的形式，以减少内存延迟并更有效地利用有限的内存带宽。(2)大规模多线程架构，如图形处理单元(gpu)，试图通过直接在硬件中快速切换多个线程来隐藏对内存的高延迟。本文将探讨这两种技术的交集。我们研究了在大规模多线程GPU上加速具有显著局部性的高度并行工作负载的影响。我们观察到，片上缓存看到的内存访问流是硬件线程调度器所做决策的直接结果。我们的工作提出了一种硬件调度技术，该技术对来自内存系统的反馈做出反应，以创建一个更缓存友好的访问流。我们使用模拟评估了我们的技术，并显示了比以前提出的调度机制有显着的性能改进。通过比较使用我们的调度器和LRU替换策略的缓存命中率与使用最优缓存替换策略的其他调度技术，我们展示了调度作为一种缓存管理技术的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Learning your limit: managing massively multithreaded caches through scheduling

The gap between processor and memory performance has become a focal point for microprocessor research and development over the past three decades. Modern architectures use two orthogonal approaches to help alleviate this issue: (1) Almost every microprocessor includes some form of on-chip storage, usually in the form of caches, to decrease memory latency and make more effective use of limited memory bandwidth. (2) Massively multithreaded architectures, such as graphics processing units (GPUs), attempt to hide the high latency to memory by rapidly switching between many threads directly in hardware. This paper explores the intersection of these two techniques. We study the effect of accelerating highly parallel workloads with significant locality on a massively multithreaded GPU. We observe that the memory access stream seen by on-chip caches is the direct result of decisions made by the hardware thread scheduler. Our work proposes a hardware scheduling technique that reacts to feedback from the memory system to create a more cache-friendly access stream. We evaluate our technique using simulations and show a significant performance improvement over previously proposed scheduling mechanisms. We demonstrate the effectiveness of scheduling as a cache management technique by comparing cache hit rate using our scheduler and an LRU replacement policy against other scheduling techniques using an optimal cache replacement policy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助