将流行度纳入社会网络分析的主题模型

Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho
{"title":"将流行度纳入社会网络分析的主题模型","authors":"Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho","doi":"10.1145/2484028.2484086","DOIUrl":null,"url":null,"abstract":"Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in \"irrelevant\" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this \"popularity problem\", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a \"popularity component\" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":"{\"title\":\"Incorporating popularity in topic models for social network analysis\",\"authors\":\"Youngchul Cha, Bin Bi, Chu-Cheng Hsieh, Junghoo Cho\",\"doi\":\"10.1145/2484028.2484086\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in \\\"irrelevant\\\" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this \\\"popularity problem\\\", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a \\\"popularity component\\\" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.\",\"PeriodicalId\":178818,\"journal\":{\"name\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"41\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2484028.2484086\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 41

摘要

主题模型用于将文本数据集中的单词分组为一组相关主题。不幸的是,当一些单词频繁出现在数据集中时,由主题模型识别的主题组变得嘈杂,因为这些频繁出现的单词反复出现在“不相关”的主题组中。这种噪声在文本数据集中并不是一个严重的问题,因为频繁的单词(例如,and is)没有太多的意义,在主题模型分析之前已经被简单地删除了。然而,在我们感兴趣的社交网络数据集中,他们对应于受欢迎的人(例如,巴拉克·奥巴马和贾斯汀·比伯),不能简单地删除,因为大多数人对他们感兴趣。为了解决这个“流行度问题”,我们在主题模型中显式地对节点(词)的流行度进行建模。为此,我们首先引入“流行度组件”的概念,并提出有效容纳流行度组件的主题模型扩展。我们用真实的Twitter数据集来评估模型的有效性。与最先进的基线相比,我们提出的模型显着降低了困惑度(即更好的预测能力)。除了高入边度节点的受欢迎程度问题外,我们还通过另一个主题模型扩展研究了出边度的影响。结果表明,考虑向外边缘度对实现较低的困惑度没有太大帮助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Incorporating popularity in topic models for social network analysis
Topic models are used to group words in a text dataset into a set of relevant topics. Unfortunately, when a few words frequently appear in a dataset, the topic groups identified by topic models become noisy because these frequent words repeatedly appear in "irrelevant" topic groups. This noise has not been a serious problem in a text dataset because the frequent words (e.g., the and is) do not have much meaning and have been simply removed before a topic model analysis. However, in a social network dataset we are interested in, they correspond to popular persons (e.g., Barack Obama and Justin Bieber) and cannot be simply removed because most people are interested in them. To solve this "popularity problem", we explicitly model the popularity of nodes (words) in topic models. For this purpose, we first introduce a notion of a "popularity component" and propose topic model extensions that effectively accommodate the popularity component. We evaluate the effectiveness of our models with a real-world Twitter dataset. Our proposed models achieve significantly lower perplexity (i.e., better prediction power) compared to the state-of-the-art baselines. In addition to the popularity problem caused by the nodes with high incoming edge degree, we also investigate the effect of the outgoing edge degree with another topic model extensions. We show that considering outgoing edge degree does not help much in achieving lower perplexity.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Search engine switching detection based on user personal preferences and behavior patterns Workshop on benchmarking adaptive retrieval and recommender systems: BARS 2013 A test collection for entity search in DBpedia Sentiment analysis of user comments for one-class collaborative filtering over ted talks A document rating system for preference judgements
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1