Optimizing K-means text document clustering using latent semantic indexing and pillar algorithm

2017 5th International Symposium on Computational and Business Intelligence (ISCBI) Pub Date : 2017-08-01 DOI:10.1109/ISCBI.2017.8053549

Sigit Adinugroho, Y. A. Sari, M. A. Fauzi, P. P. Adikara

引用次数: 9

Abstract

Document clustering is an important tool to help managing the vast amount of digital text document. This paper introduces a new approach to cluster text document. First, text is preprocessed and indexed using inverted index. Then the index is trimmed using TF-DF thresholding. After that, Term Document Matrix is built based on TF-IDF. Next step uses Latent Semantic Indexing to extract important feature from Term Document Matrix. The following process is selecting seeds via Pillar algorithm. Based on determined seeds, K-Means clustering is performed. Experiment result proves that this approach outperforms standard K-Means document clustering.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用潜在语义索引和支柱算法优化K-means文本文档聚类

文档聚类是帮助管理海量数字文本文档的重要工具。本文介绍了一种新的文本文档聚类方法。首先，使用倒排索引对文本进行预处理和索引。然后使用TF-DF阈值调整索引。然后，基于TF-IDF构建术语文档矩阵。下一步使用潜在语义索引从术语文档矩阵中提取重要特征。下面的过程是通过柱子算法选择种子。基于确定的种子，进行K-Means聚类。实验结果表明，该方法优于标准K-Means文档聚类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 5th International Symposium on Computational and Business Intelligence (ISCBI)

自引率

0.00%

发文量