提出波斯语文本分类的改进组合

2016 Eighth International Conference on Information and Knowledge Technology (IKT) Pub Date : 2016-09-01 DOI:10.1109/IKT.2016.7777773

M. Jahantigh, M. Erfani, N. Daneshpour, Nargess Orojlou

{"title":"提出波斯语文本分类的改进组合","authors":"M. Jahantigh, M. Erfani, N. Daneshpour, Nargess Orojlou","doi":"10.1109/IKT.2016.7777773","DOIUrl":null,"url":null,"abstract":"Since text mining saves a large amount of information in text format, it has a very high potential application. One of the main applications of text mining is to classify texts in subject order. In this paper, we tried to propose a aarianew method in order to increase classification accuracy and efficiency, by considering different methods of Persian text classification. We used a number of 5330 news of Hamshahri data collection, for classification. In pre-processing of texts for removing stop words, we proposed a new method by using entropy of words. To extract the feature, word frequencies, and Tf-idf methods have been used. K nearest neighbor algorithm, Naive Bayes classification, and mixture of classifiers, have been used to classify texts, by using combinational classification and mixture of experts. Implementation of proposed method has caused a 15 percent improvement comparing to the previous works done on this data collection, by presenting entropy in pre-processing and also mixture of classifiers. In the best condition, scientific and cultural news has gained 96.36 percent classification accuracy.","PeriodicalId":205496,"journal":{"name":"2016 Eighth International Conference on Information and Knowledge Technology (IKT)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Presenting an improved combination for classification of Persian texts\",\"authors\":\"M. Jahantigh, M. Erfani, N. Daneshpour, Nargess Orojlou\",\"doi\":\"10.1109/IKT.2016.7777773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Since text mining saves a large amount of information in text format, it has a very high potential application. One of the main applications of text mining is to classify texts in subject order. In this paper, we tried to propose a aarianew method in order to increase classification accuracy and efficiency, by considering different methods of Persian text classification. We used a number of 5330 news of Hamshahri data collection, for classification. In pre-processing of texts for removing stop words, we proposed a new method by using entropy of words. To extract the feature, word frequencies, and Tf-idf methods have been used. K nearest neighbor algorithm, Naive Bayes classification, and mixture of classifiers, have been used to classify texts, by using combinational classification and mixture of experts. Implementation of proposed method has caused a 15 percent improvement comparing to the previous works done on this data collection, by presenting entropy in pre-processing and also mixture of classifiers. In the best condition, scientific and cultural news has gained 96.36 percent classification accuracy.\",\"PeriodicalId\":205496,\"journal\":{\"name\":\"2016 Eighth International Conference on Information and Knowledge Technology (IKT)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Eighth International Conference on Information and Knowledge Technology (IKT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IKT.2016.7777773\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Eighth International Conference on Information and Knowledge Technology (IKT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IKT.2016.7777773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

由于文本挖掘以文本形式保存了大量的信息，因此具有很高的应用潜力。文本挖掘的一个主要应用是按主题顺序对文本进行分类。本文通过对不同波斯语文本分类方法的综合考虑，提出了一种新的波斯语文本分类方法，以提高分类精度和效率。我们收集了5330条Hamshahri新闻的数据，进行分类。在文本预处理中，我们提出了一种利用词熵去除停止词的新方法。为了提取特征，使用了词频和Tf-idf方法。K最近邻算法、朴素贝叶斯分类和混合分类器已被用于文本分类，通过使用组合分类和混合专家。通过在预处理和混合分类器中呈现熵，与之前在该数据收集上所做的工作相比，所提出的方法的实现已经带来了15%的改进。在最佳状态下，科技文化新闻的分类准确率达到96.36%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Presenting an improved combination for classification of Persian texts

Since text mining saves a large amount of information in text format, it has a very high potential application. One of the main applications of text mining is to classify texts in subject order. In this paper, we tried to propose a aarianew method in order to increase classification accuracy and efficiency, by considering different methods of Persian text classification. We used a number of 5330 news of Hamshahri data collection, for classification. In pre-processing of texts for removing stop words, we proposed a new method by using entropy of words. To extract the feature, word frequencies, and Tf-idf methods have been used. K nearest neighbor algorithm, Naive Bayes classification, and mixture of classifiers, have been used to classify texts, by using combinational classification and mixture of experts. Implementation of proposed method has caused a 15 percent improvement comparing to the previous works done on this data collection, by presenting entropy in pre-processing and also mixture of classifiers. In the best condition, scientific and cultural news has gained 96.36 percent classification accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Eighth International Conference on Information and Knowledge Technology (IKT)

自引率

0.00%

发文量

期刊最新文献

An ontology based data model for Iranian research information QoS-aware provider selection in e-services supply chain Correlation analysis as a dependency measures for inferring of time-lagged gene regulatory network EDSS: An extended deterministic scale-free small world network A modified language modeling method for authorship attribution