Limitations Of Cache Prefetching On A Bus-based Multiprocessor

Proceedings of the 20th Annual International Symposium on Computer Architecture Pub Date : 1993-05-01 DOI:10.1145/165123.165163

D. Tullsen, S. Eggers

{"title":"Limitations Of Cache Prefetching On A Bus-based Multiprocessor","authors":"D. Tullsen, S. Eggers","doi":"10.1145/165123.165163","DOIUrl":null,"url":null,"abstract":"Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"77","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/165123.165163","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 77

Abstract

Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于总线的多处理器缓存预取的局限性

编译器导向的缓存预取有可能隐藏当前和未来高性能处理器所看到的大量高内存延迟。然而，预取并非没有成本，特别是在多处理器上。预取会对总线利用率、总体缓存丢失率、内存延迟和数据共享产生负面影响。我们模拟了在基于总线的多处理器上运行的特定编译器定向预取算法的效果。我们表明，尽管内存延迟很高，但这种架构并不非常适合预取。对于架构上的几个变体，当向工作负载中添加预取时，五个并行程序的加速不超过39%，而性能下降高达7%。根据几种不同的预取策略，我们检查了缓存丢失的来源，并确定了性能变化的原因。对于当前的编译器导向的预取器，无效缺失带来了一个特殊的问题。我们应用了两种技术来减少它们的影响:为写共享数据定制的特殊预取启发式方法，以及重组共享数据以减少错误共享，从而使传统的预取算法能够很好地工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 20th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量

期刊最新文献

Design Tradeoffs For Software-managed Tlbs The Architecture Of A Fault-tolerant Cached RAID Controller Architectural Support For Translation Table Management In Large Address Space Machines The TickerTAIP Parallel RAID Architecture Hierarchical Performance Modeling With MACS: A Case Study Of The Convex C-240