越南语文本分类方法比较研究

2007 IEEE International Conference on Research, Innovation and Vision for the Future Pub Date : 2007-03-05 DOI:10.1109/RIVF.2007.369167

Cong Duy Vu Hoang, Dinh Dien, N. Nguyen, H. Ngo

{"title":"越南语文本分类方法比较研究","authors":"Cong Duy Vu Hoang, Dinh Dien, N. Nguyen, H. Ngo","doi":"10.1109/RIVF.2007.369167","DOIUrl":null,"url":null,"abstract":"Text classification concerns the problem of automatically assigning given text passages (or documents) into predefined categories (or topics). Whereas a wide range of methods have been applied to English text classification, relatively few studies have been done on Vietnamese text classification. Based on a Vietnamese news corpus, we present two different approaches for the Vietnamese text classification problem. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that these approaches could achieve an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents (3 docs/sec). Additionally, we also analyze the advantages and disadvantages of each approach to find out the best method in specific circumstances.","PeriodicalId":158887,"journal":{"name":"2007 IEEE International Conference on Research, Innovation and Vision for the Future","volume":"125 38","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"A Comparative Study on Vietnamese Text Classification Methods\",\"authors\":\"Cong Duy Vu Hoang, Dinh Dien, N. Nguyen, H. Ngo\",\"doi\":\"10.1109/RIVF.2007.369167\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text classification concerns the problem of automatically assigning given text passages (or documents) into predefined categories (or topics). Whereas a wide range of methods have been applied to English text classification, relatively few studies have been done on Vietnamese text classification. Based on a Vietnamese news corpus, we present two different approaches for the Vietnamese text classification problem. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that these approaches could achieve an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents (3 docs/sec). Additionally, we also analyze the advantages and disadvantages of each approach to find out the best method in specific circumstances.\",\"PeriodicalId\":158887,\"journal\":{\"name\":\"2007 IEEE International Conference on Research, Innovation and Vision for the Future\",\"volume\":\"125 38\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-03-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE International Conference on Research, Innovation and Vision for the Future\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIVF.2007.369167\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE International Conference on Research, Innovation and Vision for the Future","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIVF.2007.369167","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

文本分类涉及到将给定的文本段落(或文档)自动分配到预定义的类别(或主题)中的问题。虽然英语文本分类的方法已经非常广泛，但对越南语文本分类的研究相对较少。基于越南语新闻语料库，提出了两种不同的越南语文本分类方法。通过使用单词袋- BOW和统计N-Gram语言建模- N-Gram方法，我们能够评估这两种广泛使用的分类方法，并表明这些方法可以在平均79分钟的时间内对大约14,000个文档(3个文档/秒)进行分类，平均达到>95%的准确率。此外，我们还分析了每种方法的优缺点，以找出在具体情况下的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comparative Study on Vietnamese Text Classification Methods

Text classification concerns the problem of automatically assigning given text passages (or documents) into predefined categories (or topics). Whereas a wide range of methods have been applied to English text classification, relatively few studies have been done on Vietnamese text classification. Based on a Vietnamese news corpus, we present two different approaches for the Vietnamese text classification problem. By using the Bag Of Words - BOW and Statistical N-Gram Language Modeling - N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that these approaches could achieve an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents (3 docs/sec). Additionally, we also analyze the advantages and disadvantages of each approach to find out the best method in specific circumstances.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 IEEE International Conference on Research, Innovation and Vision for the Future

自引率

0.00%

发文量