B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors

David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul V. Gratz, Daniel A. Jiménez
{"title":"B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors","authors":"David Kadjo, Jinchun Kim, Prabal Sharma, Reena Panda, Paul V. Gratz, Daniel A. Jiménez","doi":"10.1109/MICRO.2014.29","DOIUrl":null,"url":null,"abstract":"For decades, the primary tools in alleviating the \"Memory Wall\" have been large cache hierarchies and dataprefetchers. Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt, given VLSI power scaling trends, this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover, in the context of CMPs running multiple concurrent processes, prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n, Stride, etc.) or high-overhead and high accuracy (STeMS, ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing \"best-of-class\" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average, with 65% less storage overhead.","PeriodicalId":6591,"journal":{"name":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","volume":"66 1","pages":"623-634"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 47th Annual IEEE/ACM International Symposium on Microarchitecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MICRO.2014.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 40

Abstract

For decades, the primary tools in alleviating the "Memory Wall" have been large cache hierarchies and dataprefetchers. Both approaches, become more challenging in modern, Chip-multiprocessor (CMP) design. Increasing the last-level cache (LLC) size yields diminishing returns in terms of performance per Watt, given VLSI power scaling trends, this approach becomes hard to justify. These trends also impact hardware budgets for prefetchers. Moreover, in the context of CMPs running multiple concurrent processes, prefetching accuracy is critical to prevent cache pollution effects. These concerns point to the need for a light-weight prefetcher with high accuracy. Existing data prefetchers may generally be classified as low-overhead and low accuracy (Next-n, Stride, etc.) or high-overhead and high accuracy (STeMS, ISB). Wepropose B-Fetch: a data prefetcher driven by branch prediction and effective address value speculation. B-Fetch leverages control flow prediction to generate an expected future path of the executing application. It then speculatively computes the effective address of the load instructions along that path based upon a history of past register transformations. Detailed simulation using a cycle accurate simulator shows a geometric mean speedup of 23.4% for single-threaded workloads, improving to 28.6% for multi-application workloads over a baseline system without prefetching. We find that B-Fetch outperforms an existing "best-of-class" light-weight prefetcher under single-threaded and multi programmed workloads by 9% on average, with 65% less storage overhead.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
B-Fetch:用于芯片多处理器的分支预测定向预取
几十年来,缓解“内存墙”的主要工具一直是大型缓存层次结构和数据提取器。在现代的芯片多处理器(CMP)设计中,这两种方法都变得更具挑战性。考虑到VLSI功率缩放趋势,增加最后一级缓存(LLC)尺寸的每瓦特性能收益递减,这种方法变得难以证明。这些趋势也会影响预取器的硬件预算。此外,在cmp运行多个并发进程的情况下,预取的准确性对于防止缓存污染效应至关重要。这些问题表明需要一个轻量级的高精度预取器。现有的数据预取器通常可以分为低开销和低精度(Next-n, Stride等)或高开销和高精度(stem, ISB)。我们提出了B-Fetch:一种由分支预测和有效地址值推测驱动的数据预取器。B-Fetch利用控制流预测来生成执行应用程序的预期未来路径。然后,它根据过去寄存器转换的历史,推测地计算沿着该路径的加载指令的有效地址。使用周期精确模拟器进行的详细模拟显示,单线程工作负载的几何平均加速提升为23.4%,在没有预取的基线系统上,多应用程序工作负载的几何平均加速提升为28.6%。我们发现,在单线程和多编程工作负载下,B-Fetch的性能比现有的“同类最佳”轻量级预取器平均高出9%,存储开销减少了65%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks SMiTe: Precise QoS Prediction on Real-System SMT Processors to Improve Utilization in Warehouse Scale Computers Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution Harnessing Soft Computations for Low-Budget Fault Tolerance
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1