{"title":"Discover Underlying Topics in Thai News Articles: A Comparative Study of Probabilistic and Matrix Factorization Approaches","authors":"Pimpitcha Pitichotchokphokhin, Piyawat Chuangkrud, Kongkan Kalakan, B. Suntisrivaraporn, Teerapong Leelanupab, Nont Kanungsukkasem","doi":"10.1109/ecti-con49241.2020.9158065","DOIUrl":null,"url":null,"abstract":"Topic modeling is an unsupervised learning approach, which can automatically discover the hidden thematic structure in text documents. For text mining, topic modeling is a language-independent technique that disregards grammar and word order. Apart from semantic and structural issues, Thai language is typically considered more complex than others. Due to the lack of word delimiter and a surfeit of composite words. Errors from word tokenization can create significant problems for any post processes of text, such as document retrieval, sentiment analysis, machine translation, etc., adversely decreasing the performance of text applications. Despite a strong correlation between word ordering and semantic meaning, topic modeling has been widely reported that it can extract latent information, aka. latent topic or latent semantic, encoded in documents. Although there were few previous research works on studying topic modeling in Thai language, they mostly focused on upstream processes of Natural Language Processing (NLP) in, for example, applying a refined stop-word list to, or adding N-gram on a single specific topic modeling method. To our knowledge, this paper is the first to explore different topic modeling approaches, i.e., Latent Dirichlet Allocation (LDA) and Nonnegative Metrix Factorization (NMF), in Thai Language to compare their coherence. We also employ and compare a set of state-of-the-art evaluation metrics based on Topic Coherence.","PeriodicalId":371552,"journal":{"name":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ecti-con49241.2020.9158065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Topic modeling is an unsupervised learning approach, which can automatically discover the hidden thematic structure in text documents. For text mining, topic modeling is a language-independent technique that disregards grammar and word order. Apart from semantic and structural issues, Thai language is typically considered more complex than others. Due to the lack of word delimiter and a surfeit of composite words. Errors from word tokenization can create significant problems for any post processes of text, such as document retrieval, sentiment analysis, machine translation, etc., adversely decreasing the performance of text applications. Despite a strong correlation between word ordering and semantic meaning, topic modeling has been widely reported that it can extract latent information, aka. latent topic or latent semantic, encoded in documents. Although there were few previous research works on studying topic modeling in Thai language, they mostly focused on upstream processes of Natural Language Processing (NLP) in, for example, applying a refined stop-word list to, or adding N-gram on a single specific topic modeling method. To our knowledge, this paper is the first to explore different topic modeling approaches, i.e., Latent Dirichlet Allocation (LDA) and Nonnegative Metrix Factorization (NMF), in Thai Language to compare their coherence. We also employ and compare a set of state-of-the-art evaluation metrics based on Topic Coherence.
主题建模是一种无监督学习方法,可以自动发现文本文档中隐藏的主题结构。对于文本挖掘,主题建模是一种不考虑语法和词序的独立于语言的技术。除了语义和结构问题,泰语通常被认为比其他语言更复杂。由于缺乏词分隔符和过量的复合词。单词标记化产生的错误会给文本的任何后处理(如文档检索、情感分析、机器翻译等)带来严重的问题,从而降低文本应用程序的性能。尽管词序和语义之间有很强的相关性,主题建模已经被广泛报道,它可以提取潜在的信息,即。在文档中编码的潜在主题或潜在语义。虽然之前对泰语主题建模的研究较少,但大多集中在自然语言处理(Natural language Processing, NLP)的上游过程,如将一个精炼的停止词列表应用于,或在单个特定主题建模方法上添加N-gram。据我们所知,本文首次探讨了不同的主题建模方法,即潜伏狄利克雷分配(LDA)和非负矩阵分解(NMF),并比较了它们在泰语中的一致性。我们还采用并比较了一套基于主题一致性的最先进的评估指标。