A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness

Proceedings of the 40th Annual International Symposium on Computer Architecture Pub Date : 2013-06-23 DOI:10.1145/2485922.2485949

Henry Cook, Miquel Moretó, Sarah Bird, Khanh Dao, D. Patterson, K. Asanović

{"title":"A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness","authors":"Henry Cook, Miquel Moretó, Sarah Bird, Khanh Dao, D. Patterson, K. Asanović","doi":"10.1145/2485922.2485949","DOIUrl":null,"url":null,"abstract":"Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":"108 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"135","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 40th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2485922.2485949","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 135

Abstract

Computing workloads often contain a mix of interactive, latency-sensitive foreground applications and recurring background computations. To guarantee responsiveness, interactive and batch applications are often run on disjoint sets of resources, but this incurs additional energy, power, and capital costs. In this paper, we evaluate the potential of hardware cache partitioning mechanisms and policies to improve efficiency by allowing background applications to run simultaneously with interactive foreground applications, while avoiding degradation in interactive responsiveness. We evaluate these tradeoffs using commercial x86 multicore hardware that supports cache partitioning, and find that real hardware measurements with full applications provide different observations than past simulation-based evaluations. Co-scheduling applications without LLC partitioning leads to a 10% energy improvement and average throughput improvement of 54% compared to running tasks separately, but can result in foreground performance degradation of up to 34% with an average of 6%. With optimal static LLC partitioning, the average energy improvement increases to 12% and the average throughput improvement to 60%, while the worst case slowdown is reduced noticeably to 7% with an average slowdown of only 2%. We also evaluate a practical low-overhead dynamic algorithm to control partition sizes, and are able to realize the potential performance guarantees of the optimal static approach, while increasing background throughput by an additional 19%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对缓存分区进行硬件评估，以提高利用率和能源效率，同时保持响应性

计算工作负载通常包含交互式的、对延迟敏感的前台应用程序和反复出现的后台计算。为了保证响应性，交互式和批处理应用程序通常在不相交的资源集上运行，但这会产生额外的能源、电力和资本成本。在本文中，我们评估了硬件缓存分区机制和策略的潜力，通过允许后台应用程序与交互式前台应用程序同时运行来提高效率，同时避免交互响应能力的下降。我们使用支持缓存分区的商用x86多核硬件评估了这些权衡，发现完整应用程序的真实硬件测量提供了与过去基于模拟的评估不同的观察结果。与单独运行任务相比，没有LLC分区的协同调度应用程序可以节省10%的能源，平均吞吐量提高54%，但可能导致前台性能下降高达34%，平均下降6%。使用最优静态LLC分区，平均能量改进提高到12%，平均吞吐量提高到60%，而最坏情况下的减速明显降低到7%，平均减速仅为2%。我们还评估了一种实用的低开销动态算法来控制分区大小，并且能够实现最优静态方法的潜在性能保证，同时将后台吞吐量额外提高19%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 40th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量