HEAT:一种高效且经济的cpu协同过滤推荐训练系统

Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao
{"title":"HEAT:一种高效且经济的cpu协同过滤推荐训练系统","authors":"Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao","doi":"10.1145/3577193.3593717","DOIUrl":null,"url":null,"abstract":"Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs\",\"authors\":\"Chengming Zhang, Shaden Smith, Baixi Sun, Jiannan Tian, Jon Soifer, Xiaodong Yu, S. Song, Yuxiong He, Dingwen Tao\",\"doi\":\"10.1145/3577193.3593717\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593717\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

协同过滤(CF)已被证明是最有效的推荐技术之一。在所有CF方法中,SimpleX是最先进的方法,它采用了一种新颖的损失函数和适当数量的负样本。然而,目前还没有在多核cpu上优化SimpleX的工作,导致性能有限。为此,我们对现有的SimpleX实现进行了深入的概要分析和分析,并确定了它们的性能瓶颈,包括:(1)不规则的内存访问,(2)不必要的内存拷贝,以及(3)冗余计算。为了解决这些问题,我们提出了一个高效的CF训练系统(称为HEAT),该系统充分启用了现代cpu的多级缓存和多线程功能。具体来说,HEAT的优化有三个方面:(1)它对嵌入矩阵进行平铺以增加数据局部性并减少缓存丢失(从而减少读取延迟);(2)通过并行化向量积而不是矩阵乘法来优化随机梯度下降(SGD),特别是其中的相似性计算,避免了矩阵数据准备的内存拷贝;(3)积极地将前向阶段的中间结果重用到后向阶段,减少冗余计算。在使用x86和arm架构处理器的五个广泛使用的数据集上进行的评估表明,使用NVIDIA V100 GPU的HEAT比现有的CPU解决方案实现了高达45.2倍的加速提升,在云端实现了4.5倍的加速提升和7.9倍的成本降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation. Among all CF approaches, SimpleX is the state-of-the-art method that adopts a novel loss function and a proper number of negative samples. However, there is no work that optimizes SimpleX on multi-core CPUs, leading to limited performance. To this end, we perform an in-depth profiling and analysis of existing SimpleX implementations and identify their performance bottlenecks including (1) irregular memory accesses, (2) unnecessary memory copies, and (3) redundant computations. To address these issues, we propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs. Specifically, the optimization of HEAT is threefold: (1) It tiles the embedding matrix to increase data locality and reduce cache misses (thus reduces read latency); (2) It optimizes stochastic gradient descent (SGD) with sampling by parallelizing vector products instead of matrix-matrix multiplications, in particular the similarity computation therein, to avoid memory copies for matrix data preparation; and (3) It aggressively reuses intermediate results from the forward phase in the backward phase to alleviate redundant computation. Evaluation on five widely used datasets with both x86- and ARM-architecture processors shows that HEAT achieves up to 45.2× speedup over existing CPU solution and 4.5× speedup and 7.9× cost reduction in Cloud over existing GPU solution with NVIDIA V100 GPU.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FLORIA: A Fast and Featherlight Approach for Predicting Cache Performance FT-topo: Architecture-Driven Folded-Triangle Partitioning for Communication-efficient Graph Processing Using Additive Modifications in LU Factorization Instead of Pivoting GRAP: Group-level Resource Allocation Policy for Reconfigurable Dragonfly Network in HPC Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1