Discover Underlying Topics in Thai News Articles: A Comparative Study of Probabilistic and Matrix Factorization Approaches

Pimpitcha Pitichotchokphokhin, Piyawat Chuangkrud, Kongkan Kalakan, B. Suntisrivaraporn, Teerapong Leelanupab, Nont Kanungsukkasem
{"title":"Discover Underlying Topics in Thai News Articles: A Comparative Study of Probabilistic and Matrix Factorization Approaches","authors":"Pimpitcha Pitichotchokphokhin, Piyawat Chuangkrud, Kongkan Kalakan, B. Suntisrivaraporn, Teerapong Leelanupab, Nont Kanungsukkasem","doi":"10.1109/ecti-con49241.2020.9158065","DOIUrl":null,"url":null,"abstract":"Topic modeling is an unsupervised learning approach, which can automatically discover the hidden thematic structure in text documents. For text mining, topic modeling is a language-independent technique that disregards grammar and word order. Apart from semantic and structural issues, Thai language is typically considered more complex than others. Due to the lack of word delimiter and a surfeit of composite words. Errors from word tokenization can create significant problems for any post processes of text, such as document retrieval, sentiment analysis, machine translation, etc., adversely decreasing the performance of text applications. Despite a strong correlation between word ordering and semantic meaning, topic modeling has been widely reported that it can extract latent information, aka. latent topic or latent semantic, encoded in documents. Although there were few previous research works on studying topic modeling in Thai language, they mostly focused on upstream processes of Natural Language Processing (NLP) in, for example, applying a refined stop-word list to, or adding N-gram on a single specific topic modeling method. To our knowledge, this paper is the first to explore different topic modeling approaches, i.e., Latent Dirichlet Allocation (LDA) and Nonnegative Metrix Factorization (NMF), in Thai Language to compare their coherence. We also employ and compare a set of state-of-the-art evaluation metrics based on Topic Coherence.","PeriodicalId":371552,"journal":{"name":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ecti-con49241.2020.9158065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Topic modeling is an unsupervised learning approach, which can automatically discover the hidden thematic structure in text documents. For text mining, topic modeling is a language-independent technique that disregards grammar and word order. Apart from semantic and structural issues, Thai language is typically considered more complex than others. Due to the lack of word delimiter and a surfeit of composite words. Errors from word tokenization can create significant problems for any post processes of text, such as document retrieval, sentiment analysis, machine translation, etc., adversely decreasing the performance of text applications. Despite a strong correlation between word ordering and semantic meaning, topic modeling has been widely reported that it can extract latent information, aka. latent topic or latent semantic, encoded in documents. Although there were few previous research works on studying topic modeling in Thai language, they mostly focused on upstream processes of Natural Language Processing (NLP) in, for example, applying a refined stop-word list to, or adding N-gram on a single specific topic modeling method. To our knowledge, this paper is the first to explore different topic modeling approaches, i.e., Latent Dirichlet Allocation (LDA) and Nonnegative Metrix Factorization (NMF), in Thai Language to compare their coherence. We also employ and compare a set of state-of-the-art evaluation metrics based on Topic Coherence.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
发现泰国新闻文章的潜在主题:概率和矩阵分解方法的比较研究
主题建模是一种无监督学习方法,可以自动发现文本文档中隐藏的主题结构。对于文本挖掘,主题建模是一种不考虑语法和词序的独立于语言的技术。除了语义和结构问题,泰语通常被认为比其他语言更复杂。由于缺乏词分隔符和过量的复合词。单词标记化产生的错误会给文本的任何后处理(如文档检索、情感分析、机器翻译等)带来严重的问题,从而降低文本应用程序的性能。尽管词序和语义之间有很强的相关性,主题建模已经被广泛报道,它可以提取潜在的信息,即。在文档中编码的潜在主题或潜在语义。虽然之前对泰语主题建模的研究较少,但大多集中在自然语言处理(Natural language Processing, NLP)的上游过程,如将一个精炼的停止词列表应用于,或在单个特定主题建模方法上添加N-gram。据我们所知,本文首次探讨了不同的主题建模方法,即潜伏狄利克雷分配(LDA)和非负矩阵分解(NMF),并比较了它们在泰语中的一致性。我们还采用并比较了一套基于主题一致性的最先进的评估指标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Simple Tunable Biquadratic Digital Bandpass Filter Design for Spectrum Sensing in Cognitive Radio ElectricVehicle Simulator Using DC Drives Comparison of Machine Learning Algorithm’s on Self-Driving Car Navigation using Nvidia Jetson Nano Enhancing CNN Based Knowledge Graph Embedding Algorithms Using Auxiliary Vectors: A Case Study of Wordnet Knowledge Graph A Study of Radiated EMI Predictions from Measured Common-mode Currents for Switching Power Supplies
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1