PinText: A Multitask Text Embedding System in Pinterest

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining Pub Date : 2019-07-25 DOI:10.1145/3292500.3330671

Jinfeng Zhuang, Yu Liu

{"title":"PinText: A Multitask Text Embedding System in Pinterest","authors":"Jinfeng Zhuang, Yu Liu","doi":"10.1145/3292500.3330671","DOIUrl":null,"url":null,"abstract":"Text embedding is a fundamental component for extracting text features in production-level data mining and machine learning systems given textual information is the most ubiqutious signals. However, practitioners often face the tradeoff between effectiveness of underlying embedding algorithms and cost of training and maintaining various embedding results in large-scale applications. In this paper, we propose a multitask text embedding solution called PinText for three major vertical surfaces including homefeed, related pins, and search in Pinterest, which consolidates existing text embedding algorithms into a single solution and produces state-of-the-art performance. Specifically, we learn word level semantic vectors by enforcing that the similarity between positive engagement pairs is larger than the similarity between a randomly sampled background pairs. Based on the learned semantic vectors, we derive embedding vector of a user, a pin, or a search query by simply averaging its word level vectors. In this common compact vector space, we are able to do unified nearest neighbor search with hashing by Hadoop jobs or dockerized images on Kubernetes cluster. Both offline evaluation and online experiments show effectiveness of this PinText system and save storage cost of multiple open-sourced embeddings significantly.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"665 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3292500.3330671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Text embedding is a fundamental component for extracting text features in production-level data mining and machine learning systems given textual information is the most ubiqutious signals. However, practitioners often face the tradeoff between effectiveness of underlying embedding algorithms and cost of training and maintaining various embedding results in large-scale applications. In this paper, we propose a multitask text embedding solution called PinText for three major vertical surfaces including homefeed, related pins, and search in Pinterest, which consolidates existing text embedding algorithms into a single solution and produces state-of-the-art performance. Specifically, we learn word level semantic vectors by enforcing that the similarity between positive engagement pairs is larger than the similarity between a randomly sampled background pairs. Based on the learned semantic vectors, we derive embedding vector of a user, a pin, or a search query by simply averaging its word level vectors. In this common compact vector space, we are able to do unified nearest neighbor search with hashing by Hadoop jobs or dockerized images on Kubernetes cluster. Both offline evaluation and online experiments show effectiveness of this PinText system and save storage cost of multiple open-sourced embeddings significantly.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PinText:一个多任务文本嵌入系统在Pinterest

文本嵌入是生产级数据挖掘和机器学习系统中文本特征提取的基本组成部分，因为文本信息是最普遍存在的信号。然而，在大规模应用中，从业者经常面临底层嵌入算法的有效性与训练和维护各种嵌入结果的成本之间的权衡。在本文中，我们提出了一个名为PinText的多任务文本嵌入解决方案，用于Pinterest中的三个主要垂直表面，包括主页提要、相关引脚和搜索，它将现有的文本嵌入算法整合到一个解决方案中，并产生最先进的性能。具体来说，我们通过强制要求积极参与对之间的相似性大于随机抽样背景对之间的相似性来学习词级语义向量。基于学习到的语义向量，我们通过对用户、pin或搜索查询的词级向量进行简单的平均，得到嵌入向量。在这个通用的压缩向量空间中，我们可以通过Hadoop作业或Kubernetes集群上的dockerized映像进行哈希，从而实现统一的最近邻搜索。离线评估和在线实验均证明了该系统的有效性，并显著节省了多个开源嵌入的存储成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量

期刊最新文献

Tackle Balancing Constraint for Incremental Semi-Supervised Support Vector Learning HATS Temporal Probabilistic Profiles for Sepsis Prediction in the ICU Large-scale User Visits Understanding and Forecasting with Deep Spatial-Temporal Tensor Factorization Framework Adaptive Influence Maximization