ARCHER: a ReRAM-based accelerator for compressed recommendation systems

IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Frontiers of Computer Science Pub Date : 2023-12-23 DOI:10.1007/s11704-023-3397-x
Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen, Hai Jin
{"title":"ARCHER: a ReRAM-based accelerator for compressed recommendation systems","authors":"Xinyang Shen, Xiaofei Liao, Long Zheng, Yu Huang, Dan Chen, Hai Jin","doi":"10.1007/s11704-023-3397-x","DOIUrl":null,"url":null,"abstract":"<p>Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"36 1","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-3397-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21× in terms of performance and 56.06×, 6.45×, and 1.71× in terms of energy savings, respectively.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ARCHER:基于 ReRAM 的压缩推荐系统加速器
现代推荐系统广泛应用于现代数据中心。在传统平台上,随机和稀疏的嵌入查找操作是处理推荐系统的主要性能瓶颈,因为这些操作会导致大量数据在计算单元和内存之间移动。基于 ReRAM 的内存处理(PIM)可以在嵌入向量存储的地方对其进行处理,从而解决这一问题。但是,嵌入表很容易超出基于 ReRAM 的单片式 PIM 芯片的容量限制,从而导致片外访问,这可能会抵消 PIM 的利润。因此,我们在芯片上部署分解模型,并利用 ReRAM 的高计算效率来弥补解压缩性能的损失。在本文中,我们提出了基于 ReRAM 的 PIM 架构 ARCHER,该架构可在资源限制条件下实现完全片上推荐。首先,我们对分解表的计算模式和访问模式进行了全面分析。根据计算模式,我们将分解模型各层的操作统一为乘法累加操作。根据访问观察结果,我们提出了分层映射模式和专用硬件设计,以最大限度地提高资源利用率。在统一的计算和映射策略下,我们可以协调处理元素间的流水线。评估结果表明,ARCHER 在性能方面分别优于最先进的基于 GPU 的 DLRM 系统、最先进的近内存处理推荐系统 RecNMP 和基于 ReRAM 的推荐加速器 REREC 15.79 倍、2.21 倍和 1.21 倍,在节能方面分别优于最先进的基于 GPU 的 DLRM 系统、最先进的近内存处理推荐系统 RecNMP 和基于 ReRAM 的推荐加速器 REREC 56.06 倍、6.45 倍和 1.71 倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Frontiers of Computer Science
Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
8.60
自引率
2.40%
发文量
799
审稿时长
6-12 weeks
期刊介绍: Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.
期刊最新文献
A comprehensive survey of federated transfer learning: challenges, methods and applications DMFVAE: miRNA-disease associations prediction based on deep matrix factorization method with variational autoencoder Graph foundation model SEOE: an option graph based semantically embedding method for prenatal depression detection FedTop: a constraint-loosed federated learning aggregation method against poisoning attack
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1