CitationLDA++: an Extension of LDA for Discovering Topics in Document Network

T. Nguyen, P. Do
{"title":"CitationLDA++: an Extension of LDA for Discovering Topics in Document Network","authors":"T. Nguyen, P. Do","doi":"10.1145/3287921.3287930","DOIUrl":null,"url":null,"abstract":"Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.","PeriodicalId":448008,"journal":{"name":"Proceedings of the 9th International Symposium on Information and Communication Technology","volume":"183 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th International Symposium on Information and Communication Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3287921.3287930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Along with rapid development of electronic scientific publication repositories, automatic topics identification from papers has helped a lot for the researchers in their research. Latent Dirichlet Allocation (LDA) model is the most popular method which is used to discover hidden topics in texts basing on the co-occurrence of words in a corpus. LDA algorithm has achieved good results for large documents. However, article repositories usually only store title and abstract that are too short for LDA algorithm to work effectively. In this paper, we propose CitationLDA++ model that can improve the performance of the LDA algorithm in inferring topics of the papers basing on the title or/and abstract and citation information. The proposed model is based on the assumption that the topics of the cited papers also reflects the topics of the original paper. In this study, we divide the dataset into two sets. The first one is used to build prior knowledge source using LDA algorithm. The second is training dataset used in CitationLDA++. In the inference process with Gibbs sampling, CitationLDA++ algorithm use topics distribution of prior knowledge source and citation information to guide the process of assigning the topic to words in the text. The use of topics of cited papers helps to tackle the limit of word co-occurrence in case of linked short text. Experiments with the AMiner dataset including title or/and abstract of papers and citation information, CitationLDA++ algorithm gains better perplexity measurement than no additional knowledge. Experimental results suggest that the citation information can improve the performance of LDA algorithm to discover topics of papers in the case of full content of them are not available.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
CitationLDA++: LDA在文献网络中主题发现的扩展
随着电子科学出版物库的迅速发展,论文主题的自动识别为科研人员的研究提供了很大的帮助。潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)模型是一种基于语料库中词的共现性来发现文本中隐藏主题的常用方法。LDA算法在处理大型文档方面取得了较好的效果。然而,文章存储库通常只存储标题和摘要太短,LDA算法无法有效工作。在本文中,我们提出了CitationLDA++模型,该模型可以提高LDA算法基于标题或/和摘要和引文信息推断论文主题的性能。该模型基于被引论文的主题也反映了原论文的主题这一假设。在本研究中,我们将数据集分为两组。第一种是利用LDA算法构建先验知识源。二是CitationLDA++中使用的训练数据集。在Gibbs抽样推理过程中,CitationLDA++算法利用先验知识源和引文信息的主题分布来指导将主题分配给文本中的单词的过程。使用被引论文的主题有助于解决链接短文本中词共现的限制。在包含论文标题或/和摘要以及引文信息的AMiner数据集上进行实验,CitationLDA++算法比没有附加知识获得更好的困惑度度量。实验结果表明,引文信息可以提高LDA算法在没有全文的情况下发现论文主题的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fully Residual Convolutional Neural Networks for Aerial Image Segmentation Techniques for Improving Performance of the CPR-Based Approach Mobile multi-scale vehicle detector and its application in traffic surveillance Intelligent Assistants in Higher-Education Environments: The FIT-EBot, a Chatbot for Administrative and Learning Support Discord Discovery in Streaming Time Series based on an Improved HOT SAX Algorithm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1