Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference

Proceedings of the 16th ACM Conference on Recommender Systems Pub Date : 2022-09-18 DOI:10.1145/3523227.3547405

Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, F. Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, Kunlun Li

{"title":"Merlin HugeCTR: GPU-accelerated Recommender System Training and Inference","authors":"Zehuan Wang, Yingcan Wei, Minseok Lee, Matthias Langer, F. Yu, Jie Liu, Shijie Liu, Daniel G. Abel, Xu Guo, Jianbing Dong, Ji Shi, Kunlun Li","doi":"10.1145/3523227.3547405","DOIUrl":null,"url":null,"abstract":"In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.","PeriodicalId":443279,"journal":{"name":"Proceedings of the 16th ACM Conference on Recommender Systems","volume":"212 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3523227.3547405","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

In this talk, we introduce Merlin HugeCTR. Merlin HugeCTR is an open source, GPU-accelerated integration framework for click-through rate estimation. It optimizes both training and inference, whilst enabling model training at scale with model-parallel embeddings and data-parallel neural networks. In particular, Merlin HugeCTR combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. In the MLPerf v1.0 DLRM model training benchmark, Merlin HugeCTR achieves a speedup of up to 24.6x on a single DGX A100 (8x A100) over PyTorch on 4x4-socket CPU nodes (4x4x28 cores). Merlin HugeCTR can also take advantage of multi-node environments to accelerate training even further. Since late 2021, Merlin HugeCTR additionally features a hierarchical parameter server (HPS) and supports deployment via the NVIDIA Triton server framework, to leverage the computational capabilities of GPUs for high-speed recommendation model inference. Using this HPS, Merlin HugeCTR users can achieve a 5~62x speedup (batch size dependent) for popular recommendation models over CPU baseline implementations, and dramatically reduce their end-to-end inference latency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

梅林HugeCTR: gpu加速推荐系统的训练和推理

在这次演讲中，我们将介绍Merlin HugeCTR。Merlin HugeCTR是一个开源的gpu加速集成框架，用于点击率估计。它优化了训练和推理，同时通过模型并行嵌入和数据并行神经网络实现大规模的模型训练。特别是，Merlin HugeCTR将高性能GPU嵌入缓存与分层存储架构相结合，实现了在线模型推理任务嵌入的低延迟检索。在MLPerf v1.0 DLRM模型训练基准测试中，Merlin HugeCTR在单个DGX A100 (8x A100)上比PyTorch在4x4插槽CPU节点(4x4x28核)上实现了高达24.6倍的加速。Merlin HugeCTR还可以利用多节点环境来进一步加速训练。自2021年底以来，Merlin HugeCTR还具有分层参数服务器(HPS)，并支持通过NVIDIA Triton服务器框架进行部署，以利用gpu的计算能力进行高速推荐模型推断。使用这种HPS, Merlin HugeCTR用户可以在CPU基准实现上实现流行推荐模型的5~62倍的加速(取决于批处理大小)，并显着降低其端到端推理延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 16th ACM Conference on Recommender Systems

自引率

0.00%

发文量

期刊最新文献

Heterogeneous Graph Representation Learning for multi-target Cross-Domain Recommendation Imbalanced Data Sparsity as a Source of Unfair Bias in Collaborative Filtering Position Awareness Modeling with Knowledge Distillation for CTR Prediction Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation Denoising Self-Attentive Sequential Recommendation