{"title":"Bias aware lexicon-based Sentiment Analysis of Malay dialect on social media data: A study on the Sabah Language","authors":"M. Hijazi, Lyndia Libin, R. Alfred, Frans Coenen","doi":"10.1109/ICSITECH.2016.7852662","DOIUrl":null,"url":null,"abstract":"Sentiment Analysis (SA) has gained its popularity over the years for the benefit it brings to the development of economy, sociology and politic. SA enables observation, experiment, and quantification of emotions of the public toward a particular issue. However, there is not much SA done with respect to the Malay Language, especially in the context of the Malay dialects used in social media. The research presented in this paper aims to perform SA on one of the derivatives of the Malay language, namely Sabah Language. The Sabah Language, unlike many other languages, does not have a fixed spelling and, when used in an unstructured form as in the case of social media, poses particular difficulties for SA. This paper takes a lexicon-based approach to SA of the Sabah Language as used on social media. For the investigation, the corpuses selected were Facebook posts and tweets written in the Sabah language, 443 posts and tweets in total. Each was manually annotated as positive, negative or neutral by three annotators. As Sabah Language is a derivative of Malay language, the words used in Sabah Language contains most of Malay words. That is why, in Sentiment-Lexicon (SL) construction process, opinion-bearing Malay SL is retrieved, modified and expanded to build Sabah SL. Three different methods of assigning scores to the words in SL (opinion-bearing words) were employed during SL construction: (i) Simple PSA, (ii) Simple PSA with Switch Negation (PSA-SN) and (iii) Strength-based PSA. In this paper, pre-processing phase that includes spellchecker and shortform corrector is also implemented to reduce distinct word to be analyzed for SA. In classification phase, two classification methods, simple and bias aware classifications, were used to classify the posts. Experiments are conducted to show the effect of SL modification and expansion, the effect of pre-processing as well as the effect of bias-aware classification to the SA performed. Results show the highest accuracy of 85.10% was achieved using bias-aware classification with the modified and expanded SL, scores are assigned using Simple PSA and the pre-processed text.","PeriodicalId":447090,"journal":{"name":"2016 2nd International Conference on Science in Information Technology (ICSITech)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 2nd International Conference on Science in Information Technology (ICSITech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSITECH.2016.7852662","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Sentiment Analysis (SA) has gained its popularity over the years for the benefit it brings to the development of economy, sociology and politic. SA enables observation, experiment, and quantification of emotions of the public toward a particular issue. However, there is not much SA done with respect to the Malay Language, especially in the context of the Malay dialects used in social media. The research presented in this paper aims to perform SA on one of the derivatives of the Malay language, namely Sabah Language. The Sabah Language, unlike many other languages, does not have a fixed spelling and, when used in an unstructured form as in the case of social media, poses particular difficulties for SA. This paper takes a lexicon-based approach to SA of the Sabah Language as used on social media. For the investigation, the corpuses selected were Facebook posts and tweets written in the Sabah language, 443 posts and tweets in total. Each was manually annotated as positive, negative or neutral by three annotators. As Sabah Language is a derivative of Malay language, the words used in Sabah Language contains most of Malay words. That is why, in Sentiment-Lexicon (SL) construction process, opinion-bearing Malay SL is retrieved, modified and expanded to build Sabah SL. Three different methods of assigning scores to the words in SL (opinion-bearing words) were employed during SL construction: (i) Simple PSA, (ii) Simple PSA with Switch Negation (PSA-SN) and (iii) Strength-based PSA. In this paper, pre-processing phase that includes spellchecker and shortform corrector is also implemented to reduce distinct word to be analyzed for SA. In classification phase, two classification methods, simple and bias aware classifications, were used to classify the posts. Experiments are conducted to show the effect of SL modification and expansion, the effect of pre-processing as well as the effect of bias-aware classification to the SA performed. Results show the highest accuracy of 85.10% was achieved using bias-aware classification with the modified and expanded SL, scores are assigned using Simple PSA and the pre-processed text.