X-cache:特定领域缓存的模块化架构

Proceedings of the 49th Annual International Symposium on Computer Architecture Pub Date : 2022-06-11 DOI:10.1145/3470496.3527380

A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman

{"title":"X-cache:特定领域缓存的模块化架构","authors":"A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman","doi":"10.1145/3470496.3527380","DOIUrl":null,"url":null,"abstract":"With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"X-cache: a modular architecture for domain-specific caches\",\"authors\":\"A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman\",\"doi\":\"10.1145/3470496.3527380\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.\",\"PeriodicalId\":337932,\"journal\":{\"name\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"volume\":\"122 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3470496.3527380\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着Dennard缩放法的终结，架构师开始转向特定于领域的加速器(dsa)。最先进的dsa使用稀疏数据[37]和间接索引数据结构[18,30]。它们引入了非仿射和动态内存访问[7,35]，并且需要特定于域的缓存。不幸的是，缓存控制器因难以架构而臭名昭著;领域专门化使问题复杂化。DSA缓存需要支持自定义标记、数据结构遍历、多次重新填充和预加载。以前的dsa包括临时缓存结构，不实现缓存控制器。我们提出X-Cache，这是一种用于dsa的可重用缓存方式。我们将开源生成RTL和编程X-Cache的工具链。这里有三个关键思想:i) dsa特定的标签(Meta-tag):设计人员可以使用dsa元数据中的任何字段组合作为标签。元标记消除了遍历元数据并将其转换为全局地址的开销。这节省了能源，并改善了负载使用延迟。ii) dsa可编程行走器(X-Actions):我们发现一组通用的微码动作可用于实现dsa特定的行走、数据块和标签管理。我们开发了一个可编程的微码引擎，可以有效地实现数据编排。iii) dsa可移植控制器(x -例程):我们使用可移植抽象，协程，让设计师表达行走和编排。协程捕获块级并行性，保持轻量级，并最大限度地减少控制器占用。我们为四种不同的DSA家族创建缓存:Sparse GEMM [35,37]， GraphPulse [30]， DASX[22]和Widx[18]。X-Cache的性能比基于地址的缓存高出1.7倍，与硬连线dsa相比仍然具有竞争力(在一个案例中甚至提高了50%)。我们证明，与地址标签相比，元标签节省了26- 79%的能源。在X-Cache中，元标签消耗1.5- 6.5%的数据RAM能量，可编程微码又增加了7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

X-cache: a modular architecture for domain-specific caches

With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量