DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.65

Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal

{"title":"DiDi: Mitigating the Performance Impact of TLB Shootdowns Using a Shared TLB Directory","authors":"Carlos Villavieja, Vasileios Karakostas, L. Vilanova, Yoav Etsion, Alex Ramírez, A. Mendelson, N. Navarro, A. Cristal, O. Unsal","doi":"10.1109/PACT.2011.65","DOIUrl":null,"url":null,"abstract":"Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"101","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 101

Abstract

Translation Look aside Buffers (TLBs) are ubiquitously used in modern architectures to cache virtual-to-physical mappings and, as they are looked up on every memory access, are paramount to performance scalability. The emergence of chip-multiprocessors (CMPs) with per-core TLBs, has brought the problem of TLB coherence to front stage. TLBs are kept coherent at the software-level by the operating system (OS). Whenever the OS modifies page permissions in a page table, it must initiate a coherency transaction among TLBs, a process known as a TLB shoot down. Current CMPs rely on the OS to approximate the set of TLBs caching a mapping and synchronize TLBs using costly Inter-Proceessor Interrupts (IPIs) and software handlers. In this paper, we characterize the impact of TLB shoot downs on multiprocessor performance and scalability, and present the design of a scalable TLB coherency mechanism. First, we show that both TLB shoot down cost and frequency increase with the number of processors and project that software-based TLB shoot downs would thwart the performance of large multiprocessors. We then present a scalable architectural mechanism that couples a shared TLB directory with load/store queue support for lightweight TLB invalidation, and thereby eliminates the need for costly IPIs. Finally, we show that the proposed mechanism reduces the fraction of machine cycles wasted on TLB shoot downs by an order of magnitude.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

DiDi:使用共享的TLB目录减轻TLB宕机对性能的影响

翻译暂置缓冲区(tlb)在现代体系结构中普遍用于缓存虚拟到物理的映射，并且由于它们在每次内存访问时都被查找，因此对性能可伸缩性至关重要。随着具有单核TLB的芯片多处理器(cmp)的出现，TLB的一致性问题被提上了前台。tlb由操作系统(OS)在软件级别保持一致。每当操作系统修改页表中的页权限时，它必须启动TLB之间的一致性事务，这个过程称为TLB shoot down。当前的cmp依赖于操作系统来近似缓存映射的tlb集合，并使用昂贵的处理器间中断(ipi)和软件处理程序来同步tlb。在本文中，我们描述了TLB故障对多处理器性能和可扩展性的影响，并提出了一种可扩展的TLB一致性机制的设计。首先，我们证明了TLB故障成本和频率都随着处理器数量的增加而增加，并预测基于软件的TLB故障会阻碍大型多处理器的性能。然后，我们提出了一种可扩展的体系结构机制，该机制将共享TLB目录与负载/存储队列耦合在一起，以支持轻量级TLB失效，从而消除了对昂贵的ip的需求。最后，我们表明，所提出的机制减少了TLB击落上浪费的机器周期的一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量