B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture Pub Date : 2014-12-13 DOI:10.1109/MICRO.2014.29

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul V. Gratz, Daniel A. Jiménez

{"title":"B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors","authors":"David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul V. Gratz, Daniel A. Jiménez","doi":"10.1109/MICRO.2014.29","DOIUrl":null,"url":null,"abstract":"For decades, the primary tools in alleviating the \"Memory Wall\" have been large cache hierarchies and dataprefetchers. Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt, given VLSI power scaling trends, this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover, in the context of CMPs running multiple concurrent processes, prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n, Stride, etc.) or high-overhead and high accuracy (STeMS, ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing \"best-of-class\" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average, with 65% less storage overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"66 1","pages":"623-634"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

Abstract

For decades, the primary tools in alleviating the "Memory Wall" have been large cache hierarchies and dataprefetchers. Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt, given VLSI power scaling trends, this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover, in the context of CMPs running multiple concurrent processes, prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n, Stride, etc.) or high-overhead and high accuracy (STeMS, ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing "best-of-class" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average, with 65% less storage overhead.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

B-Fetch:用于芯片多处理器的分支预测定向预取

几十年来，缓解“内存墙”的主要工具一直是大型缓存层次结构和数据提取器。在现代的芯片多处理器(CMP)设计中，这两种方法都变得更具挑战性。考虑到VLSI功率缩放趋势，增加最后一级缓存(LLC)尺寸的每瓦特性能收益递减，这种方法变得难以证明。这些趋势也会影响预取器的硬件预算。此外，在cmp运行多个并发进程的情况下，预取的准确性对于防止缓存污染效应至关重要。这些问题表明需要一个轻量级的高精度预取器。现有的数据预取器通常可以分为低开销和低精度(Next-n, Stride等)或高开销和高精度(stem, ISB)。我们提出了B-Fetch:一种由分支预测和有效地址值推测驱动的数据预取器。B-Fetch利用控制流预测来生成执行应用程序的预期未来路径。然后，它根据过去寄存器转换的历史，推测地计算沿着该路径的加载指令的有效地址。使用周期精确模拟器进行的详细模拟显示，单线程工作负载的几何平均加速提升为23.4%，在没有预取的基线系统上，多应用程序工作负载的几何平均加速提升为28.6%。我们发现，在单线程和多编程工作负载下，B-Fetch的性能比现有的“同类最佳”轻量级预取器平均高出9%，存储开销减少了65%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

自引率

0.00%

发文量