A Comparative Study on Vietnamese Text Classification Methods

2007 IEEE International Conference on Research, Innovation and Vision for the Future Pub Date : 2007-03-05 DOI:10.1109/RIVF.2007.369167

Cong Duy Vu Hoang, Dinh Dien, N. Nguyen, H. Ngo

引用次数: 25

Abstract

Text classification concerns the problem of automatically assigning given text passages (or documents) into predefined categories (or topics). Whereas a wide range of methods have been applied to English text classification, relatively few studies have been done on Vietnamese text classification. Based on a Vietnamese news corpus, we present two different approaches for the Vietnamese text classification problem. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that these approaches could achieve an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents (3 docs/sec). Additionally, we also analyze the advantages and disadvantages of each approach to find out the best method in specific circumstances.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

越南语文本分类方法比较研究

文本分类涉及到将给定的文本段落(或文档)自动分配到预定义的类别(或主题)中的问题。虽然英语文本分类的方法已经非常广泛，但对越南语文本分类的研究相对较少。基于越南语新闻语料库，提出了两种不同的越南语文本分类方法。通过使用单词袋- BOW和统计N-Gram语言建模- N-Gram方法，我们能够评估这两种广泛使用的分类方法，并表明这些方法可以在平均79分钟的时间内对大约14,000个文档(3个文档/秒)进行分类，平均达到>95%的准确率。此外，我们还分析了每种方法的优缺点，以找出在具体情况下的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2007 IEEE International Conference on Research, Innovation and Vision for the Future

自引率

0.00%

发文量