Nexus:分布式共享缓存中复制的新方法

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2017-09-01 DOI:10.1109/PACT.2017.42

Po-An Tsai, Nathan Beckmann, Daniel Sánchez

{"title":"Nexus:分布式共享缓存中复制的新方法","authors":"Po-An Tsai, Nathan Beckmann, Daniel Sánchez","doi":"10.1109/PACT.2017.42","DOIUrl":null,"url":null,"abstract":"Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.","PeriodicalId":438103,"journal":{"name":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Nexus: A New Approach to Replication in Distributed Shared Caches\",\"authors\":\"Po-An Tsai, Nathan Beckmann, Daniel Sánchez\",\"doi\":\"10.1109/PACT.2017.42\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.\",\"PeriodicalId\":438103,\"journal\":{\"name\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"volume\":\"97 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2017.42\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2017.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

最后一级缓存越来越分散，由许多小银行组成。为了表现良好，大多数访问必须由靠近请求核心的银行提供服务。一种有吸引力的方法是复制只读数据，以便在附近有副本可用。但是复制在容量和延迟之间引入了一个微妙的权衡:太少的复制迫使核心访问遥远的银行，而太多的复制浪费缓存空间并导致过多的芯片外丢失。工作负载所需的复制量差异很大，因此需要一种自适应方法。以前的自适应复制技术只复制每个块的本地库中的数据，因此它们关注于选择要复制的数据。不幸的是，未复制的数据仍然需要遍历整个网络，从而限制了这些技术的性能。我们认为，更好的策略是让核心共享副本，而自适应方案应侧重于选择复制的数量(即，在芯片上有多少副本)。这个想法充分利用了延迟与容量之间的权衡，实现了比以前的自适应复制技术在质量上更高的性能。它可以应用于许多先前的缓存组织，我们在两个例子中进行了演示:Nexus-R扩展了R-NUCA, Nexus-J扩展了Jigsaw。我们对运行在144核芯片上的HPC和服务器工作负载上的Nexus进行了评估，它优于先前的自适应复制方案，在所有对复制敏感的工作负载上，性能提高了高达90%，平均提高了23%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Nexus: A New Approach to Replication in Distributed Shared Caches

Last-level caches are increasingly distributed, consisting of many small banks. To perform well, most accesses must be served by banks near requesting cores. An attractive approach is to replicate read-only data so that a copy is available nearby. But replication introduces a delicate tradeoff between capacity and latency: too little replication forces cores to access faraway banks, while too much replication wastes cache space and causes excessive off-chip misses. Workloads vary widely in their desired amount of replication, demanding an adaptive approach. Prior adaptive replication techniques only replicate data in each tile's local bank, so they focus on selecting which data to replicate. Unfortunately, data that is not replicated still incurs a full network traversal, limiting the performance of these techniques.We argue that a better strategy is to let cores share replicas and that adaptive schemes should focus on selecting how much to replicate (i.e., how many replicas to have across the chip). This idea fully exploits the latency-capacity tradeoff, achieving qualitatively higher performance than prior adaptive replication techniques. It can be applied to many prior cache organizations, and we demonstrate it on two: Nexus-R extends R-NUCA, and Nexus-J extends Jigsaw. We evaluate Nexus on HPC and server workloads running on a 144-core chip, where it outperforms prior adaptive replication schemes and improves performance by up to 90% and by 23% on average across all workloads sensitive to replication.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量