Document clustering based on time series

2015 19th International Conference on System Theory, Control and Computing (ICSTCC) Pub Date : 2015-10-01 DOI:10.1109/ICSTCC.2015.7321281

L. Matei, Stefan Trausan-Matu

引用次数: 2

Abstract

This paper presents a novel document clustering algorithm that represents documents as a time series of words. Document clustering is very important due to the fact that it permits us to group them based on some certain criteria, especially nowadays when a large number of articles are available. The timed series representation of the document instead of the vector model permits us to consider a new algorithm for the computation of the distance between documents: dynamic time warping. This novel representation together with the dynamic time warping algorithm represents the foundation for computing the similarity and the clustering of the documents. The clustering algorithm used is hierarchical clustering. This novel clustering method of texts is applied on named entities and on the parts of speech of the words that compose the documents. As test data we are using the Reuters corpus of newspaper articles.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于时间序列的文档聚类

本文提出了一种新的文档聚类算法，该算法将文档表示为单词的时间序列。文档聚类非常重要，因为它允许我们根据某些特定的标准对它们进行分组，特别是在有大量文章可用的今天。文档的时间序列表示而不是向量模型允许我们考虑一种计算文档之间距离的新算法:动态时间翘曲。这种新颖的表示和动态时间规整算法为计算文档的相似度和聚类奠定了基础。使用的聚类算法是分层聚类。这种新颖的文本聚类方法应用于命名实体和组成文档的单词的词性。作为测试数据，我们使用路透社的报纸文章语料库。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 19th International Conference on System Theory, Control and Computing (ICSTCC)

自引率

0.00%

发文量