DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory

Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal
{"title":"DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory","authors":"Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal","doi":"10.1109/PACT.2011.65","DOIUrl":null,"url":null,"abstract":"Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 101

Abstract

Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DiDi:使用共享的TLB目录减轻TLB宕机对性能的影响
翻译暂置缓冲区(tlb)在现代体系结构中普遍用于缓存虚拟到物理的映射,并且由于它们在每次内存访问时都被查找,因此对性能可伸缩性至关重要。随着具有单核TLB的芯片多处理器(cmp)的出现,TLB的一致性问题被提上了前台。tlb由操作系统(OS)在软件级别保持一致。每当操作系统修改页表中的页权限时,它必须启动TLB之间的一致性事务,这个过程称为TLB shoot down。当前的cmp依赖于操作系统来近似缓存映射的tlb集合,并使用昂贵的处理器间中断(ipi)和软件处理程序来同步tlb。在本文中,我们描述了TLB故障对多处理器性能和可扩展性的影响,并提出了一种可扩展的TLB一致性机制的设计。首先,我们证明了TLB故障成本和频率都随着处理器数量的增加而增加,并预测基于软件的TLB故障会阻碍大型多处理器的性能。然后,我们提出了一种可扩展的体系结构机制,该机制将共享TLB目录与负载/存储队列耦合在一起,以支持轻量级TLB失效,从而消除了对昂贵的ip的需求。最后,我们表明,所提出的机制减少了TLB击落上浪费的机器周期的一个数量级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization An Alternative Memory Access Scheduling in Manycore Accelerators DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory Compiling Dynamic Data Structures in Python to Enable the Use of Multi-core and Many-core Libraries Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1