Urdu News Classification: An Empirical Study Using Machine Learning Techniques

Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz
{"title":"Urdu News Classification: An Empirical Study Using Machine Learning Techniques","authors":"Aman Farooq, Zainab Noreen, Safiyah Batool, Fouzia Naz","doi":"10.1109/MAJICC56935.2022.9994152","DOIUrl":null,"url":null,"abstract":"Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.","PeriodicalId":205027,"journal":{"name":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Mohammad Ali Jinnah University International Conference on Computing (MAJICC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MAJICC56935.2022.9994152","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Text is a rich source of information and there is unlimited text on the internet. Automatic text classification is a technique to label those text documents with predefined categories. This has various applications including sentiment analysis, spam detection, NLP etc. There is much work done on english text classification but there is a huge gap with Urdu. There isn't any standard algorithm known that outperforms all others. Also it is observed that classifiers usually perform better when the text is preprocessed, but there aren't any standard stemmer, stop word list, tokenizer etc. available for urdu text. Urdu is rich morphologically and it's a challenge to design preprocessing tools for urdu. This research tends to reduce the gap by testing different classification algorithms using different dimensionality reduction combinations on urdu news data set to know which performs better. It also includes designing a stemmer, tokenizer and preparing a stop word list. In this research it was concluded that SVM performed better with the combination of both preprocessing techniques. Fasttext library was also tested for urdu text classification which achieved 95%accuracy and f-score 1 %less than SVM. Another approach used is that topic modeling has been performed using LDA and documents have been weighed as topics. Classification using documents as topics didn't perform well but Random Forest performed better than Naive Bayes and SVM. It's in future work to design a POS tagger that may improve performance of stemmer and to test deep learning methods for urdu text classification.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
乌尔都语新闻分类:使用机器学习技术的实证研究
文本是一种丰富的信息来源,互联网上有无限的文本。自动文本分类是一种用预定义的类别标记这些文本文档的技术。它有各种各样的应用,包括情感分析、垃圾邮件检测、自然语言处理等。在英语文本分类方面已经做了很多工作,但与乌尔都语文本分类存在巨大差距。没有任何已知的标准算法比其他所有算法都要好。此外,我们还观察到,当文本经过预处理时,分类器通常会表现得更好,但乌尔都语文本没有任何标准的词干、停止词列表、标记器等可用。乌尔都语具有丰富的语态,设计乌尔都语预处理工具是一项挑战。本研究试图通过在乌尔都语新闻数据集上使用不同的降维组合测试不同的分类算法,以了解哪种分类算法的性能更好,从而缩小两者之间的差距。它还包括设计一个词干,标记器和准备一个停止词列表。在本研究中得出结论,支持向量机在两种预处理技术的结合下表现更好。Fasttext库也用于乌尔都语文本分类测试,准确率达到95%,f-score比SVM低1%。使用的另一种方法是使用LDA执行主题建模,并将文档作为主题进行权衡。使用文档作为主题的分类效果不佳,但随机森林的分类效果优于朴素贝叶斯和支持向量机。未来的工作是设计一个词性标注器,以提高词性标注器的性能,并测试乌尔都语文本分类的深度学习方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Feature Selection via GM-CPSO and Binary Conversion: Analyses on a Binary-Class Dataset Integrating Blockchain with IoT for Mitigating Cyber Threat In Corporate Environment Evaluating Automatic CV Shortlisting Tool For Job Recruitment Based On Machine Learning Techniques Proteins Classification Using An Improve Darknet-53 Deep Learning Model Heart Failure Prediction Using Machine learning Approaches
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1