{"title":"A support vector machine mixed with TF-IDF algorithm to categorize Bengali document","authors":"Md Saiful Islam, Fazla Elahi Md Jubayer, S. Ahmed","doi":"10.1109/ECACE.2017.7912904","DOIUrl":null,"url":null,"abstract":"Document categorization is a technique through which the category of a document is determined. This paper deals with the automatic classification of Bangla documents. In this proposed categorization system, a support vector machine is used for classifying a document in predefine twelve categories. In this classification model TFIDF (term frequency-inverse document frequency) weighting with length normalization is used for feature selection after the preprocessing of data set is complete. It is shown that the results achieved by applying SVM to classify the category of a Bangla document are very promising as compared to conventional methods where features are chosen on the basis of bag-of-words. The accuracy of this proposed methodology is 92.57% for twelve categories.","PeriodicalId":333370,"journal":{"name":"2017 International Conference on Electrical, Computer and Communication Engineering (ECCE)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on Electrical, Computer and Communication Engineering (ECCE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECACE.2017.7912904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33
Abstract
Document categorization is a technique through which the category of a document is determined. This paper deals with the automatic classification of Bangla documents. In this proposed categorization system, a support vector machine is used for classifying a document in predefine twelve categories. In this classification model TFIDF (term frequency-inverse document frequency) weighting with length normalization is used for feature selection after the preprocessing of data set is complete. It is shown that the results achieved by applying SVM to classify the category of a Bangla document are very promising as compared to conventional methods where features are chosen on the basis of bag-of-words. The accuracy of this proposed methodology is 92.57% for twelve categories.