{"title":"Clustering Analysis of Bangla News Articles with TF-IDF & CV Using Mini-Batch K-Means and K-Means","authors":"S. Hasan, Wang Ruiqin, Md Gulzar Hussain","doi":"10.1109/CyberneticsCom55287.2022.9865339","DOIUrl":null,"url":null,"abstract":"Document clustering is the compilation of docu-ments relating to textual content into classes or clusters. The primary objective is to group the documents that are internally logical but substantially different from each other. It is a vital method used in the retrieval of information, extraction of information and organization of records. Around 210 million people worldwide speak Bangla as a first or second language. With the passage of time, these computer-assisted approaches were also used in the Bangla language. However, not enough paper has represented the current state of research in Bangla Document Clustering. The ultimate aim of this work is to achieve the objective of testing K-Means clustering and Mini-Batch K-Means clustering algorithms and analysing the performance with silhouette score and homogeneity score of these algorithms for Bangla news text data. The findings shows that with TF-IDF both K-Mean and MiniBatch K-Mean algorithms gives silhouette score of 0.031 & 0.015 and homogeneity score of 0.33 & 0.27 for 11 clusters which is better than the results with CountVectorizer.","PeriodicalId":178279,"journal":{"name":"2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Cybernetics and Computational Intelligence (CyberneticsCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CyberneticsCom55287.2022.9865339","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Document clustering is the compilation of docu-ments relating to textual content into classes or clusters. The primary objective is to group the documents that are internally logical but substantially different from each other. It is a vital method used in the retrieval of information, extraction of information and organization of records. Around 210 million people worldwide speak Bangla as a first or second language. With the passage of time, these computer-assisted approaches were also used in the Bangla language. However, not enough paper has represented the current state of research in Bangla Document Clustering. The ultimate aim of this work is to achieve the objective of testing K-Means clustering and Mini-Batch K-Means clustering algorithms and analysing the performance with silhouette score and homogeneity score of these algorithms for Bangla news text data. The findings shows that with TF-IDF both K-Mean and MiniBatch K-Mean algorithms gives silhouette score of 0.031 & 0.015 and homogeneity score of 0.33 & 0.27 for 11 clusters which is better than the results with CountVectorizer.