A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models

Proceedings of the 16th ACM Conference on Recommender Systems Pub Date : 2022-09-18 DOI:10.1145/3523227.3546765

Yingcan Wei, Matthias Langer, F. Yu, Minseok Lee, Jie Liu, Ji Shi, Zehuan Wang

{"title":"A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models","authors":"Yingcan Wei, Matthias Langer, F. Yu, Minseok Lee, Jie Liu, Ji Shi, Zehuan Wang","doi":"10.1145/3523227.3546765","DOIUrl":null,"url":null,"abstract":"Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.","PeriodicalId":443279,"journal":{"name":"Proceedings of the 16th ACM Conference on Recommender Systems","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM Conference on Recommender Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3523227.3546765","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Recommendation systems are of crucial importance for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale. In this paper, we provide insights into the intriguing and challenging inference domain of online recommendation systems. We propose the HugeCTR Hierarchical Parameter Server (HPS), an industry-leading distributed recommendation inference framework, that combines a high-performance GPU embedding cache with an hierarchical storage architecture, to realize low-latency retrieval of embeddings for online model inference tasks. Among other things, HPS features (1) a redundant hierarchical storage system, (2) a novel high-bandwidth cache to accelerate parallel embedding lookup on NVIDIA GPUs, (3) online training support and (4) light-weight APIs for easy integration into existing large-scale recommendation workflows. To demonstrate its capabilities, we conduct extensive studies using both synthetically engineered and public datasets. We show that our HPS can dramatically reduce end-to-end inference latency, achieving 5~62x speedup (depending on the batch size) over CPU baseline implementations for popular recommendation models. Through multi-GPU concurrent deployment, the HPS can also greatly increase the inference QPS.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向大规模深度推荐模型的gpu专用推理参数服务器

推荐系统对于各种现代应用程序和网络服务至关重要，例如新闻推送、社交网络、电子商务、搜索等。为了达到峰值预测精度，现代推荐模型将深度学习与tb级嵌入表相结合，以获得底层数据的细粒度表示。传统的推理服务架构需要将整个模型部署到独立的服务器上，这在如此大规模的情况下是不可实现的。在本文中，我们对在线推荐系统的有趣和具有挑战性的推理领域提供了见解。我们提出了HugeCTR分层参数服务器(HPS)，这是一个业界领先的分布式推荐推理框架，它结合了高性能GPU嵌入缓存和分层存储架构，实现了在线模型推理任务的低延迟嵌入检索。除此之外，HPS具有以下特点:(1)冗余分层存储系统;(2)新颖的高带宽缓存，可加速NVIDIA gpu上的并行嵌入查找;(3)在线培训支持;(4)轻量级api，可轻松集成到现有的大规模推荐工作流中。为了证明其能力，我们使用综合工程和公共数据集进行了广泛的研究。我们表明，我们的HPS可以显著减少端到端推理延迟，与流行推荐模型的CPU基线实现相比，实现了5~62倍的加速(取决于批处理大小)。通过多gpu并发部署，HPS还可以大大提高推理QPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 16th ACM Conference on Recommender Systems

自引率

0.00%

发文量

期刊最新文献

Heterogeneous Graph Representation Learning for multi-target Cross-Domain Recommendation Imbalanced Data Sparsity as a Source of Unfair Bias in Collaborative Filtering Position Awareness Modeling with Knowledge Distillation for CTR Prediction Multi-Modal Dialog State Tracking for Interactive Fashion Recommendation Denoising Self-Attentive Sequential Recommendation