Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza
{"title":"中间地址空间:针对高速缓存驻留工作负载优化异构架构的虚拟内存","authors":"Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza","doi":"10.1145/3659207","DOIUrl":null,"url":null,"abstract":"<p>The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of <i>Accelerated Processing Units</i> (APUs), processors that incorporate both a <i>central processing unit</i> (CPU) and an <i>integrated graphics processing unit</i> (iGPU). </p><p>However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new <i>intermediate address space</i> (IAS) in both APU and CPU systems. IAS serves as a bridge between <i>virtual address</i> (VA) spaces and <i>physical address</i> (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant <i>translation look-aside buffer</i> (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation. </p><p>Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"9 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads\",\"authors\":\"Qunyou Liu, Darong Huang, Luis Costero, Marina Zapater, David Atienza\",\"doi\":\"10.1145/3659207\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of <i>Accelerated Processing Units</i> (APUs), processors that incorporate both a <i>central processing unit</i> (CPU) and an <i>integrated graphics processing unit</i> (iGPU). </p><p>However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new <i>intermediate address space</i> (IAS) in both APU and CPU systems. IAS serves as a bridge between <i>virtual address</i> (VA) spaces and <i>physical address</i> (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant <i>translation look-aside buffer</i> (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation. </p><p>Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.</p>\",\"PeriodicalId\":50920,\"journal\":{\"name\":\"ACM Transactions on Architecture and Code Optimization\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Architecture and Code Optimization\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3659207\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3659207","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
摘要
对计算能力日益增长的需求和异构计算架构的出现,推动了对创新技术的探索,以解决当前计算和内存子系统的局限性。其中一种解决方案是使用加速处理单元(APU),即同时集成了中央处理器(CPU)和集成图形处理单元(iGPU)的处理器。然而,地址转换开销会严重影响 APU 和 CPU 系统的性能,导致整体性能下降,尤其是对于高速缓存驻留的工作负载。为解决这一问题,我们建议在 APU 和 CPU 系统中引入新的中间地址空间(IAS)。IAS 是虚拟地址(VA)空间和物理地址(PA)空间之间的桥梁,可优化地址转换过程。就 APU 系统而言,我们的研究表明,在某些工作负载情况下,iGPU 会出现严重的翻译查找旁侧缓冲区 (TLB) 错失。使用 IAS,我们可以将初始地址转换分为前端和后端两个阶段,从而有效地将地址转换的瓶颈从高速缓存侧转移到内存控制器侧,事实证明这种技术对高速缓存驻留的工作负载非常有效。我们的仿真证明,与传统 CPU 系统相比,在 CPU 系统中实施 IAS 最多可将性能提高 40%。此外,我们还评估了 APU 系统的有效性,比较了基于 IAS 的系统与传统系统的性能,结果表明,采用我们提出的 IAS 实现后,APU 系统的性能最多可提高 185%。此外,我们的分析表明,90% 以上的 TLB 错失可由高速缓存过滤,在系统中采用更大的高速缓存有可能带来更大的改进。所提出的 IAS 为提高 APU 和 CPU 系统的性能提供了一种前景广阔的实用解决方案,为计算机体系结构领域的最新研究做出了贡献。
Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloads
The increasing demand for computing power and the emergence of heterogeneous computing architectures have driven the exploration of innovative techniques to address current limitations in both the compute and memory subsystems. One such solution is the use of Accelerated Processing Units (APUs), processors that incorporate both a central processing unit (CPU) and an integrated graphics processing unit (iGPU).
However, the performance of both APU and CPU systems can be significantly hampered by address translation overhead, leading to a decline in overall performance, especially for cache-resident workloads. To address this issue, we propose the introduction of a new intermediate address space (IAS) in both APU and CPU systems. IAS serves as a bridge between virtual address (VA) spaces and physical address (PA) spaces, optimizing the address translation process. In the case of APU systems, our research indicates that the iGPU suffers from significant translation look-aside buffer (TLB) misses in certain workload situations. Using an IAS, we can divide the initial address translation into front- and back-end phases, effectively shifting the bottleneck in address translation from the cache side to the memory controller side, a technique that proves to be effective for cache-resident workloads. Our simulations demonstrate that implementing IAS in the CPU system can boost performance by up to 40% compared to conventional CPU systems. Furthermore, we evaluate the effectiveness of APU systems, comparing the performance of IAS-based systems with traditional systems, showing up to a 185% improvement in APU system performance with our proposed IAS implementation.
Furthermore, our analysis indicates that over 90% of TLB misses can be filtered by the cache, and employing a larger cache within the system could potentially result in even greater improvements. The proposed IAS offers a promising and practical solution to enhance the performance of both APU and CPU systems, contributing to state-of-the-art research in the field of computer architecture.
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.