Phrase Based Topic Modeling for Semantic Information Processing in Biomedicine.

Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications Pub Date : 2013-12-01 Epub Date: 2014-04-10 DOI:10.1109/ICMLA.2013.89

Zhiguo Yu, Todd R Johnson, Ramakanth Kavuluru

{"title":"Phrase Based Topic Modeling for Semantic Information Processing in Biomedicine.","authors":"Zhiguo Yu, Todd R Johnson, Ramakanth Kavuluru","doi":"10.1109/ICMLA.2013.89","DOIUrl":null,"url":null,"abstract":"<p><p>Given that unstructured data is increasing exponentially everyday, extracting and understanding the information, themes, and relationships from large collections of documents is increasingly important to researchers in many disciplines including biomedicine. Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling technique based on the \"bag-of-words\" assumption that has been applied extensively to unveil hidden semantic themes within large sets of textual documents. Recently, it was extended using the \"bag-of-n-grams\" paradigm to account for word order. In this paper, we present an alternative phrase based LDA model to move from a bag of words or n-grams paradigm to a \"bag-of-key-phrases\" setting by applying a key phrase extraction technique, the C-value method, to further explore latent themes. We evaluate our approach by using a phrase intrusion user study and demonstrate that our model can help LDA generate better and more interpretable topics than those generated using the bag-of-n-grams approach. Given topic models essentially are statistical tools, an important problem in topic modeling is that of visualizing and interacting with the models to understand and extract new information from a collection. To evaluate our phrase based modeling approach in this context, we incorporate it in an open source interactive topic browser. Qualitative evaluations of this browser with biomedical experts demonstrate that our approach can aid biomedical researchers gain better and faster understanding of their document collections.</p>","PeriodicalId":74528,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications","volume":"2013 ","pages":"440-445"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICMLA.2013.89","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2013.89","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2014/4/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Given that unstructured data is increasing exponentially everyday, extracting and understanding the information, themes, and relationships from large collections of documents is increasingly important to researchers in many disciplines including biomedicine. Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling technique based on the "bag-of-words" assumption that has been applied extensively to unveil hidden semantic themes within large sets of textual documents. Recently, it was extended using the "bag-of-n-grams" paradigm to account for word order. In this paper, we present an alternative phrase based LDA model to move from a bag of words or n-grams paradigm to a "bag-of-key-phrases" setting by applying a key phrase extraction technique, the C-value method, to further explore latent themes. We evaluate our approach by using a phrase intrusion user study and demonstrate that our model can help LDA generate better and more interpretable topics than those generated using the bag-of-n-grams approach. Given topic models essentially are statistical tools, an important problem in topic modeling is that of visualizing and interacting with the models to understand and extract new information from a collection. To evaluate our phrase based modeling approach in this context, we incorporate it in an open source interactive topic browser. Qualitative evaluations of this browser with biomedical experts demonstrate that our approach can aid biomedical researchers gain better and faster understanding of their document collections.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于短语的生物医学语义信息处理主题建模。

鉴于非结构化数据每天都呈指数级增长，从大量文档中提取和理解信息、主题和关系对包括生物医学在内的许多学科的研究人员来说变得越来越重要。潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)是一种基于“词袋”假设的无监督主题建模技术，已被广泛应用于揭示大量文本文档中隐藏的语义主题。最近，它被扩展为使用“n-grams”范式来解释词序。在本文中，我们提出了一种基于短语的替代LDA模型，通过应用关键短语提取技术(c值方法)进一步探索潜在主题，从单词袋或n-grams范式转变为“关键短语袋”设置。我们通过使用短语入侵用户研究来评估我们的方法，并证明我们的模型可以帮助LDA生成比使用n-grams方法生成的更好、更可解释的主题。鉴于主题模型本质上是统计工具，主题建模中的一个重要问题是可视化模型并与模型交互，以便从集合中理解和提取新信息。为了在这种情况下评估基于短语的建模方法，我们将其合并到一个开源交互式主题浏览器中。生物医学专家对该浏览器的定性评估表明，我们的方法可以帮助生物医学研究人员更好、更快地理解他们的文档集合。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the ... International Conference on Machine Learning and Applications. International Conference on Machine Learning and Applications

自引率

0.00%

发文量