PinText: A Multitask Text Embedding System in Pinterest

Jinfeng Zhuang, Yu Liu
{"title":"PinText: A Multitask Text Embedding System in Pinterest","authors":"Jinfeng Zhuang, Yu Liu","doi":"10.1145/3292500.3330671","DOIUrl":null,"url":null,"abstract":"Text embedding is a fundamental component for extracting text features in production-level data mining and machine learning systems given textual information is the most ubiqutious signals. However, practitioners often face the tradeoff between effectiveness of underlying embedding algorithms and cost of training and maintaining various embedding results in large-scale applications. In this paper, we propose a multitask text embedding solution called PinText for three major vertical surfaces including homefeed, related pins, and search in Pinterest, which consolidates existing text embedding algorithms into a single solution and produces state-of-the-art performance. Specifically, we learn word level semantic vectors by enforcing that the similarity between positive engagement pairs is larger than the similarity between a randomly sampled background pairs. Based on the learned semantic vectors, we derive embedding vector of a user, a pin, or a search query by simply averaging its word level vectors. In this common compact vector space, we are able to do unified nearest neighbor search with hashing by Hadoop jobs or dockerized images on Kubernetes cluster. Both offline evaluation and online experiments show effectiveness of this PinText system and save storage cost of multiple open-sourced embeddings significantly.","PeriodicalId":186134,"journal":{"name":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","volume":"665 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3292500.3330671","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Text embedding is a fundamental component for extracting text features in production-level data mining and machine learning systems given textual information is the most ubiqutious signals. However, practitioners often face the tradeoff between effectiveness of underlying embedding algorithms and cost of training and maintaining various embedding results in large-scale applications. In this paper, we propose a multitask text embedding solution called PinText for three major vertical surfaces including homefeed, related pins, and search in Pinterest, which consolidates existing text embedding algorithms into a single solution and produces state-of-the-art performance. Specifically, we learn word level semantic vectors by enforcing that the similarity between positive engagement pairs is larger than the similarity between a randomly sampled background pairs. Based on the learned semantic vectors, we derive embedding vector of a user, a pin, or a search query by simply averaging its word level vectors. In this common compact vector space, we are able to do unified nearest neighbor search with hashing by Hadoop jobs or dockerized images on Kubernetes cluster. Both offline evaluation and online experiments show effectiveness of this PinText system and save storage cost of multiple open-sourced embeddings significantly.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PinText:一个多任务文本嵌入系统在Pinterest
文本嵌入是生产级数据挖掘和机器学习系统中文本特征提取的基本组成部分,因为文本信息是最普遍存在的信号。然而,在大规模应用中,从业者经常面临底层嵌入算法的有效性与训练和维护各种嵌入结果的成本之间的权衡。在本文中,我们提出了一个名为PinText的多任务文本嵌入解决方案,用于Pinterest中的三个主要垂直表面,包括主页提要、相关引脚和搜索,它将现有的文本嵌入算法整合到一个解决方案中,并产生最先进的性能。具体来说,我们通过强制要求积极参与对之间的相似性大于随机抽样背景对之间的相似性来学习词级语义向量。基于学习到的语义向量,我们通过对用户、pin或搜索查询的词级向量进行简单的平均,得到嵌入向量。在这个通用的压缩向量空间中,我们可以通过Hadoop作业或Kubernetes集群上的dockerized映像进行哈希,从而实现统一的最近邻搜索。离线评估和在线实验均证明了该系统的有效性,并显著节省了多个开源嵌入的存储成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Tackle Balancing Constraint for Incremental Semi-Supervised Support Vector Learning HATS Temporal Probabilistic Profiles for Sepsis Prediction in the ICU Large-scale User Visits Understanding and Forecasting with Deep Spatial-Temporal Tensor Factorization Framework Adaptive Influence Maximization
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1