EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611546

Yuanhang Zou, Zhihao Ding, Jieming Shi, Shuting Guo, Chunchen Su, Yafei Zhang

{"title":"EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data","authors":"Yuanhang Zou, Zhihao Ding, Jieming Shi, Shuting Guo, Chunchen Su, Yafei Zhang","doi":"10.14778/3611540.3611546","DOIUrl":null,"url":null,"abstract":"In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"82 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Vldb Endowment","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3611540.3611546","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both. There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics. We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

EmbedX:一个通用的，高效的和可扩展的平台来嵌入图形和高维稀疏数据

在现代在线服务中，将网络规模的图形数据和高维稀疏数据一起处理成嵌入，用于下游任务，如推荐、广告、预测和分类，变得越来越重要。既有针对高维稀疏数据的学习方法和系统，也有针对高维稀疏数据的学习方法和系统。工业中迫切需要一个系统来有效地处理这两种类型的数据以获得更高的业务价值，然而，这是具有挑战性的。腾讯的数据包含数十亿个具有非常高维度稀疏特征的样本，而图也包含数十亿个节点和边。此外，学习模型通常执行具有高计算成本的昂贵操作。海量稀疏数据和图数据具有不同的特征，难以同时存储、管理和检索。我们提出了一个来自腾讯的工业分布式学习框架EmbedX，它是通用的，有效地支持在图和高维稀疏数据上的嵌入。EmbedX包括用于图和稀疏数据管理的分布式服务器层，以及优化的参数和图算子，以有效支持4类方法，包括高维稀疏数据的深度学习模型、网络嵌入方法、图神经网络以及内部开发的两类数据的联合学习模型。大量的腾讯数据和公共数据实验证明了EmbedX的优越性。例如，在一个包含13亿个节点、350亿条边和28亿个样本、16亿个维度的稀疏特征的腾讯数据集上，EmbedX的训练速度提高了一个数量级，我们的联合模型取得了卓越的效果。EmbedX部署在腾讯。真实用例的A/B测试进一步验证了EmbedX的强大功能。EmbedX是用c++实现的，在https://github.com/Tencent/embedx上开源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.

期刊最新文献

Uldp-FL: Federated Learning with Across-Silo User-Level Differential Privacy. Auditory Brainstem Response in a Child with Mitochondrial Disorder-Leigh Syndrome. Breathing New Life into an Old Tree: Resolving Logging Dilemma of B + -tree on Modern Computational Storage Drives QO-Insight: Inspecting Steered Query Optimizers A Learned Query Rewrite System