Effect of Preprocessing and No of Topics on Automated Topic Classification Performance

Q4 Environmental Science Iranian Journal of Botany Pub Date : 2022-06-16 DOI:10.33897/fujeas.v3i1.571

Ijaz Hussain

{"title":"Effect of Preprocessing and No of Topics on Automated Topic Classification Performance","authors":"Ijaz Hussain","doi":"10.33897/fujeas.v3i1.571","DOIUrl":null,"url":null,"abstract":"The emergence of the Internet has caused an increasing generation of data. A high amount of the data is of textual form, which is highly unstructured. Almost every field i.e, business, engineering, medicine, and science can benefit from the textual data when knowledge is extracted. The knowledge extraction requires the extraction and recording of metadata on the unstructured text documents that constitute the textual data. This phenomenon is regarded as topic modeling. The resulting topics can ease searching, statistical characterization, and classification. Some well-known algorithms for topic modeling include Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). Different parameters can affect the performance of topic modeling. An interesting parameter could be the time required to perform topic modeling. The fact that time is affected by many factors applicable to topic modeling as well; however, measuring the time concerning some constraints can be beneficial to provide insight. In this paper, we alter some preprocessing steps and topics to study their impact on the time taken by the LDA and NMF topic models. In preprocessing, we limit our study by altering only the sampling and feature subset selection whereas in the second step we have changed the number of topics. The results show a significant improvement in time.","PeriodicalId":36255,"journal":{"name":"Iranian Journal of Botany","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Iranian Journal of Botany","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33897/fujeas.v3i1.571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Environmental Science","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of the Internet has caused an increasing generation of data. A high amount of the data is of textual form, which is highly unstructured. Almost every field i.e, business, engineering, medicine, and science can benefit from the textual data when knowledge is extracted. The knowledge extraction requires the extraction and recording of metadata on the unstructured text documents that constitute the textual data. This phenomenon is regarded as topic modeling. The resulting topics can ease searching, statistical characterization, and classification. Some well-known algorithms for topic modeling include Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). Different parameters can affect the performance of topic modeling. An interesting parameter could be the time required to perform topic modeling. The fact that time is affected by many factors applicable to topic modeling as well; however, measuring the time concerning some constraints can be beneficial to provide insight. In this paper, we alter some preprocessing steps and topics to study their impact on the time taken by the LDA and NMF topic models. In preprocessing, we limit our study by altering only the sampling and feature subset selection whereas in the second step we have changed the number of topics. The results show a significant improvement in time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

预处理和主题数量对自动主题分类性能的影响

互联网的出现导致了越来越多的数据产生。大量的数据是文本形式的，这是非结构化的。当提取知识时，几乎每个领域，如商业、工程、医学和科学都可以从文本数据中受益。知识提取需要在构成文本数据的非结构化文本文档上提取和记录元数据。这种现象被称为主题建模。生成的主题可以简化搜索、统计表征和分类。一些著名的主题建模算法包括潜在狄利克雷分配(LDA)、非负矩阵分解(NMF)和概率潜在语义分析(PLSA)。不同的参数会影响主题建模的性能。一个有趣的参数可能是执行主题建模所需的时间。时间受多种因素影响的事实同样适用于主题建模;然而，测量与某些约束有关的时间可能有助于提供洞察力。在本文中，我们改变了一些预处理步骤和主题，研究了它们对LDA和NMF主题模型耗时的影响。在预处理中，我们只通过改变采样和特征子集的选择来限制我们的研究，而在第二步中，我们改变了主题的数量。结果表明在时间上有显著的改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊