Learning Global Term Weights for Content-based Recommender Systems

Proceedings of the 25th International Conference on World Wide Web Pub Date : 2016-04-11 DOI:10.1145/2872427.2883069

Yupeng Gu, Bo Zhao, D. Hardtke, Yizhou Sun

{"title":"Learning Global Term Weights for Content-based Recommender Systems","authors":"Yupeng Gu, Bo Zhao, D. Hardtke, Yizhou Sun","doi":"10.1145/2872427.2883069","DOIUrl":null,"url":null,"abstract":"Recommender systems typically leverage two types of signals to effectively recommend items to users: user activities and content matching between user and item profiles, and recommendation models in literature are usually categorized into collaborative filtering models, content-based models and hybrid models. In practice, when rich profiles about users and items are available, and user activities are sparse (cold-start), effective content matching signals become much more important in the relevance of the recommendation. The de-facto method to measure similarity between two pieces of text is computing the cosine similarity of the two bags of words, and each word is weighted by TF (term frequency within the document) x IDF (inverted document frequency of the word within the corpus). In general sense, TF can represent any local weighting scheme of the word within each document, and IDF can represent any global weighting scheme of the word across the corpus. In this paper, we focus on the latter, i.e., optimizing the global term weights, for a particular recommendation domain by leveraging supervised approaches. The intuition is that some frequent words (lower IDF, e.g. ``database'') can be essential and predictive for relevant recommendation, while some rare words (higher IDF, e.g. the name of a small company) could have less predictive power. Given plenty of observed activities between users and items as training data, we should be able to learn better domain-specific global term weights, which can further improve the relevance of recommendation. We propose a unified method that can simultaneously learn the weights of multiple content matching signals, as well as global term weights for specific recommendation tasks. Our method is efficient to handle large-scale training data generated by production recommender systems. And experiments on LinkedIn job recommendation data justify the effectiveness of our approach.","PeriodicalId":20455,"journal":{"name":"Proceedings of the 25th International Conference on World Wide Web","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 25th International Conference on World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872427.2883069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

Abstract

Recommender systems typically leverage two types of signals to effectively recommend items to users: user activities and content matching between user and item profiles, and recommendation models in literature are usually categorized into collaborative filtering models, content-based models and hybrid models. In practice, when rich profiles about users and items are available, and user activities are sparse (cold-start), effective content matching signals become much more important in the relevance of the recommendation. The de-facto method to measure similarity between two pieces of text is computing the cosine similarity of the two bags of words, and each word is weighted by TF (term frequency within the document) x IDF (inverted document frequency of the word within the corpus). In general sense, TF can represent any local weighting scheme of the word within each document, and IDF can represent any global weighting scheme of the word across the corpus. In this paper, we focus on the latter, i.e., optimizing the global term weights, for a particular recommendation domain by leveraging supervised approaches. The intuition is that some frequent words (lower IDF, e.g. ``database'') can be essential and predictive for relevant recommendation, while some rare words (higher IDF, e.g. the name of a small company) could have less predictive power. Given plenty of observed activities between users and items as training data, we should be able to learn better domain-specific global term weights, which can further improve the relevance of recommendation. We propose a unified method that can simultaneously learn the weights of multiple content matching signals, as well as global term weights for specific recommendation tasks. Our method is efficient to handle large-scale training data generated by production recommender systems. And experiments on LinkedIn job recommendation data justify the effectiveness of our approach.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于内容的推荐系统的全局词权学习

推荐系统通常利用两种类型的信号来向用户有效地推荐商品:用户活动和用户与商品配置文件之间的内容匹配，文献中的推荐模型通常分为协同过滤模型、基于内容的模型和混合模型。在实践中，当关于用户和项目的丰富配置文件可用，并且用户活动是稀疏的(冷启动)时，有效的内容匹配信号在推荐的相关性中变得更加重要。测量两段文本之间相似度的实际方法是计算两包单词的余弦相似度，每个单词由TF(文档中的术语频率)x IDF(语料库中单词的倒排文档频率)加权。一般来说，TF可以表示每个文档中单词的任何局部加权方案，而IDF可以表示整个语料库中单词的任何全局加权方案。在本文中，我们关注后者，即通过利用监督方法优化特定推荐领域的全局术语权重。直觉是，一些频繁的词(较低的IDF，例如;“数据库”)对于相关推荐来说是必要的和预测性的，而一些罕见的单词(较高的IDF，例如小公司的名称)可能具有较低的预测能力。给定大量用户和项目之间观察到的活动作为训练数据，我们应该能够更好地学习特定于领域的全局术语权重，这可以进一步提高推荐的相关性。我们提出了一种统一的方法，可以同时学习多个内容匹配信号的权重，以及特定推荐任务的全局术语权重。我们的方法可以有效地处理由生产推荐系统产生的大规模训练数据。在LinkedIn工作推荐数据上的实验证明了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 25th International Conference on World Wide Web

自引率

0.00%

发文量