X-cache:特定领域缓存的模块化架构

A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman
{"title":"X-cache:特定领域缓存的模块化架构","authors":"A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman","doi":"10.1145/3470496.3527380","DOIUrl":null,"url":null,"abstract":"With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"122 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"X-cache: a modular architecture for domain-specific caches\",\"authors\":\"A. Sedaghati, Milad Hakimi, Reza Hojabr, Arrvindh Shriraman\",\"doi\":\"10.1145/3470496.3527380\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.\",\"PeriodicalId\":337932,\"journal\":{\"name\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"volume\":\"122 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3470496.3527380\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着Dennard缩放法的终结,架构师开始转向特定于领域的加速器(dsa)。最先进的dsa使用稀疏数据[37]和间接索引数据结构[18,30]。它们引入了非仿射和动态内存访问[7,35],并且需要特定于域的缓存。不幸的是,缓存控制器因难以架构而臭名昭著;领域专门化使问题复杂化。DSA缓存需要支持自定义标记、数据结构遍历、多次重新填充和预加载。以前的dsa包括临时缓存结构,不实现缓存控制器。我们提出X-Cache,这是一种用于dsa的可重用缓存方式。我们将开源生成RTL和编程X-Cache的工具链。这里有三个关键思想:i) dsa特定的标签(Meta-tag):设计人员可以使用dsa元数据中的任何字段组合作为标签。元标记消除了遍历元数据并将其转换为全局地址的开销。这节省了能源,并改善了负载使用延迟。ii) dsa可编程行走器(X-Actions):我们发现一组通用的微码动作可用于实现dsa特定的行走、数据块和标签管理。我们开发了一个可编程的微码引擎,可以有效地实现数据编排。iii) dsa可移植控制器(x -例程):我们使用可移植抽象,协程,让设计师表达行走和编排。协程捕获块级并行性,保持轻量级,并最大限度地减少控制器占用。我们为四种不同的DSA家族创建缓存:Sparse GEMM [35,37], GraphPulse [30], DASX[22]和Widx[18]。X-Cache的性能比基于地址的缓存高出1.7倍,与硬连线dsa相比仍然具有竞争力(在一个案例中甚至提高了50%)。我们证明,与地址标签相比,元标签节省了26- 79%的能源。在X-Cache中,元标签消耗1.5- 6.5%的数据RAM能量,可编程微码又增加了7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
X-cache: a modular architecture for domain-specific caches
With Dennard scaling ending, architects are turning to domain-specific accelerators (DSAs). State-of-the-art DSAs work with sparse data [37] and indirectly-indexed data structures [18, 30]. They introduce non-affine and dynamic memory accesses [7, 35], and require domain-specific caches. Unfortunately, cache controllers are notorious for being difficult to architect; domain-specialization compounds the problem. DSA caches need to support custom tags, data-structure walks, multiple refills, and preloading. Prior DSAs include ad-hoc cache structures, and do not implement the cache controller. We propose X-Cache, a reusable caching idiom for DSAs. We will be open-sourcing a toolchain for both generating the RTL and programming X-Cache. There are three key ideas: i) DSA-specific Tags (Meta-tag): The designer can use any combination of fields from the DSA-metadata as the tag. Meta-tags eliminate the overhead of walking and translating metadata to global addresses. This saves energy, and improves load-to-use latency. ii) DSA-programmable walkers (X-Actions): We find that a common set of microcode actions can be used to implement the DSA-specific walking, data block, and tag management. We develop a programmable microcode engine that can efficiently realize the data orchestration. iii) DSA-portable controller (X-Routines): We use a portable abstraction, coroutines, to let the designer express walking and orchestration. Coroutines capture the block-level parallelism, remain lightweight, and minimize controller occupancy. We create caches for four different DSA families: Sparse GEMM [35, 37], GraphPulse [30], DASX [22], and Widx [18]. X-Cache outperforms address-based caches by 1.7 × and remains competitive with hardwired DSAs (even 50% improvement in one case). We demonstrate that meta-tags save 26--79% energy compared to address-tags. In X-Cache, meta-tags consume 1.5--6.5% of data RAM energy and the programmable microcode adds a further 7%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
BioHD: an efficient genome sequence search platform using HyperDimensional memorization MeNDA: a near-memory multi-way merge solution for sparse transposition and dataflows Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques INSPIRE: in-storage private information retrieval via protocol and architecture co-design CraterLake: a hardware accelerator for efficient unbounded computation on encrypted data
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1