Automatic Multiclass Document Classification of Hindi Poems using Machine Learning Techniques

2020 International Conference for Emerging Technology (INCET) Pub Date : 2020-06-01 DOI:10.1109/INCET49848.2020.9154001

Kaushika Pal, B. Patel

{"title":"Automatic Multiclass Document Classification of Hindi Poems using Machine Learning Techniques","authors":"Kaushika Pal, B. Patel","doi":"10.1109/INCET49848.2020.9154001","DOIUrl":null,"url":null,"abstract":"Text Classification of Indic language face fundamental challenges in terms of achieving good accuracy, as the languages are morphologically rich and too much information is fused in words. In this paper an actual experiment implemented is demonstrated for Classification of Hindi Poem documents to classify poems into 3 classes namely Shringar, Karuna and Veera. Poem content represents mood and have sentiments associated, the classification of emotions become more challenging when the language is morphologically rich. In current experiment 122 documents manually collected from web were processed and after preprocessing 122 documents were generated containing only meaningful data, than processed documents were used to extract features using Bag of Words Model and those features are converted into numeric representation for passing them into Training model. For classification 5 machine-learning classification algorithms namely Random Forest, Support Vector Machine, Decision Tree Algorithm, K nearest Neighbors and Naive Bayes each with it’s two versions are used. The model is tested with 20% of test data and the results are compared with stored label of this data to calculate accuracy. Experiments shows that Naïve Bayes with 64% accuracy and Random Forest with 56% are performing better as compared to other algorithms for Hindi Poem Classification.","PeriodicalId":174411,"journal":{"name":"2020 International Conference for Emerging Technology (INCET)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference for Emerging Technology (INCET)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INCET49848.2020.9154001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Text Classification of Indic language face fundamental challenges in terms of achieving good accuracy, as the languages are morphologically rich and too much information is fused in words. In this paper an actual experiment implemented is demonstrated for Classification of Hindi Poem documents to classify poems into 3 classes namely Shringar, Karuna and Veera. Poem content represents mood and have sentiments associated, the classification of emotions become more challenging when the language is morphologically rich. In current experiment 122 documents manually collected from web were processed and after preprocessing 122 documents were generated containing only meaningful data, than processed documents were used to extract features using Bag of Words Model and those features are converted into numeric representation for passing them into Training model. For classification 5 machine-learning classification algorithms namely Random Forest, Support Vector Machine, Decision Tree Algorithm, K nearest Neighbors and Naive Bayes each with it’s two versions are used. The model is tested with 20% of test data and the results are compared with stored label of this data to calculate accuracy. Experiments shows that Naïve Bayes with 64% accuracy and Random Forest with 56% are performing better as compared to other algorithms for Hindi Poem Classification.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用机器学习技术的印地语诗歌自动多类文档分类

印度语文本分类由于其语言形态丰富，单词中融合了过多的信息，在准确性方面面临着根本性的挑战。本文以印地语诗歌文献分类为例，进行了实际实验，将诗歌分为Shringar、Karuna和Veera三类。诗歌内容代表着情绪并与情感相关联，当语言形态丰富时，情感的分类就变得更具挑战性。本实验对人工采集的122篇网络文档进行处理，预处理后生成122篇只包含有意义数据的文档，然后利用word Bag模型提取特征，并将特征转换为数字表示传递给Training模型。对于分类，使用了5种机器学习分类算法，即随机森林，支持向量机，决策树算法，K近邻和朴素贝叶斯，每种算法都有两个版本。用20%的测试数据对模型进行测试，并将结果与该数据的存储标签进行比较，计算准确率。实验表明，Naïve Bayes的准确率为64%，Random Forest的准确率为56%，与其他印地语诗歌分类算法相比，表现更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 International Conference for Emerging Technology (INCET)

自引率

0.00%

发文量