{"title":"印尼语短文本主题建模算法的性能比较","authors":"N. Hidayati, Anne Parlina","doi":"10.1145/3575882.3575905","DOIUrl":null,"url":null,"abstract":"The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.","PeriodicalId":367340,"journal":{"name":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts\",\"authors\":\"N. Hidayati, Anne Parlina\",\"doi\":\"10.1145/3575882.3575905\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.\",\"PeriodicalId\":367340,\"journal\":{\"name\":\"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3575882.3575905\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3575882.3575905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Comparison of Topic Modeling Algorithms on Indonesian Short Texts
The number of short texts produced daily has increased significantly as a form of social communication commonly used on the internet. Extracting topics from extensive collections of short texts is one of the most challenging tasks in natural language processing, but it has numerous applications in the real world. The purpose of this study is to compare the topic extraction performance of the Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF), and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) algorithms from Indonesian short texts. The data was gathered from news articles about electric vehicles published on the online news site (Kompas.com). Regarding topic coherence scores, our results show that LDA outperforms NMF and GSDMM. However, human judgment indicates that the word clusters produced by NMF and GSDMM are easier to conclude.