Urdu Sentiment Analysis

IF 0.8 Q4 COMPUTER SCIENCE, THEORY & METHODS Applied Computer Systems Pub Date : 2022-06-01 DOI:10.2478/acss-2022-0004

Iffraah Rehman, Tariq Rahim Soomro

{"title":"Urdu Sentiment Analysis","authors":"Iffraah Rehman, Tariq Rahim Soomro","doi":"10.2478/acss-2022-0004","DOIUrl":null,"url":null,"abstract":"Abstract The world is heading towards more modernized and digitalized data and therefore a significant growth is observed in the active number of social media users with each passing day. Each post and comment can give an insight into valuable information about a certain topic or issue, a product or a brand, etc. Similarly, the process to uncover the underlying information from the opinion that a person keeps about any entity is called a sentiment analysis. The analysis can be carried out through two main approaches, i.e., either lexicon-based or machine learning algorithms. A significant amount of work in the different domains has been done in numerous languages for sentiment analysis, but minimal research has been conducted on the national language of Pakistan, which is Urdu. Twitter users who are familiar with Urdu update the tweets in two different textual formats either in Urdu Script (Nastaleeq) or in Roman Urdu. Thus, the paper is an attempt to perform the sentiment analysis on the Urdu language by extracting the tweets (Nastaleeq and Roman Urdu both) from Twitter using Tweepy API. A machine learning-based approach has been adopted for this study and the tool opted for the purpose is WEKA. The best algorithm was identified based on evaluation metrics, which comprise the number of correctly and incorrectly classified instances, accuracy, precision, and recall. SMO was found to be the most suitable machine learning algorithm for performing the sentiment analysis on Urdu (Nastaleeq) tweets, while the Roman Urdu Random Forest algorithm was identified as the best one.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"85 10 1","pages":"30 - 42"},"PeriodicalIF":0.8000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2022-0004","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract The world is heading towards more modernized and digitalized data and therefore a significant growth is observed in the active number of social media users with each passing day. Each post and comment can give an insight into valuable information about a certain topic or issue, a product or a brand, etc. Similarly, the process to uncover the underlying information from the opinion that a person keeps about any entity is called a sentiment analysis. The analysis can be carried out through two main approaches, i.e., either lexicon-based or machine learning algorithms. A significant amount of work in the different domains has been done in numerous languages for sentiment analysis, but minimal research has been conducted on the national language of Pakistan, which is Urdu. Twitter users who are familiar with Urdu update the tweets in two different textual formats either in Urdu Script (Nastaleeq) or in Roman Urdu. Thus, the paper is an attempt to perform the sentiment analysis on the Urdu language by extracting the tweets (Nastaleeq and Roman Urdu both) from Twitter using Tweepy API. A machine learning-based approach has been adopted for this study and the tool opted for the purpose is WEKA. The best algorithm was identified based on evaluation metrics, which comprise the number of correctly and incorrectly classified instances, accuracy, precision, and recall. SMO was found to be the most suitable machine learning algorithm for performing the sentiment analysis on Urdu (Nastaleeq) tweets, while the Roman Urdu Random Forest algorithm was identified as the best one.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

乌尔都语情感分析

世界正朝着更加现代化和数字化的方向发展，因此社交媒体的活跃用户数量日益显著增长。每一篇帖子和评论都可以提供关于某个主题或问题、产品或品牌等有价值信息的见解。同样，从一个人对任何实体的看法中发现潜在信息的过程被称为情感分析。分析可以通过两种主要方法进行，即基于词典或机器学习算法。在不同领域的大量工作已经在许多语言中进行了情感分析，但对巴基斯坦的国家语言乌尔都语进行的研究很少。熟悉乌尔都语的Twitter用户以两种不同的文本格式更新tweet，一种是乌尔都语脚本(Nastaleeq)，另一种是罗马乌尔都语。因此，本文试图通过使用Tweepy API从Twitter中提取推文(Nastaleeq和Roman Urdu)来对乌尔都语进行情感分析。本研究采用了一种基于机器学习的方法，为此选择的工具是WEKA。根据评估指标确定最佳算法，评估指标包括正确和错误分类实例的数量、准确性、精度和召回率。SMO被认为是最适合对乌尔都语(Nastaleeq)推文进行情感分析的机器学习算法，而罗马乌尔都语随机森林算法被认为是最好的算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Applied Computer Systems COMPUTER SCIENCE, THEORY & METHODS-

自引率

10.00%

发文量

审稿时长

30 weeks