Saprativa Bhattacharjee, Anirban Das, U. Bhattacharya, S. K. Parui, S. Roy
{"title":"Sentiment analysis using cosine similarity measure","authors":"Saprativa Bhattacharjee, Anirban Das, U. Bhattacharya, S. K. Parui, S. Roy","doi":"10.1109/ReTIS.2015.7232847","DOIUrl":null,"url":null,"abstract":"The opinion of other people is often a major factor influencing our decisions. For a consumer it affects purchase decisions and for a producer or a service provider it helps in making business decisions. Companies spend a lot of money and time on surveys for gathering the public opinion on products and services. Now-a-days the web has become a hotspot for finding user opinions on almost anything under the sun. Both money and time can be saved by mining opinions from the web. Moreover, no survey can have a sample size, which can match that of the web. Each opinion generally expresses either positive, negative or neutral sentiment. The task of identifying these sentiments is called Sentiment Analysis. This work deals with the analysis of user sentiments in the Telecom domain. Since no such related standard database of users' opinions could be found, we developed one by mining the WWW. A major issue with these sample comments is that these are usually extremely noisy, containing numerous spelling and grammatical errors, acronyms, abbreviations, shortened or slang words etc. Such data cannot be used directly for analyzing sentiments. Hence, a lexicon based preprocessing algorithm is proposed for noise reduction. A novel idea based on Cosine Similarity measure is proposed for classifying the sentiment expressed by a user's comment into a five point scale of -2 (highly negative) to +2 (highly positive). The performance of the proposed strategy is compared with some of the well-known machine learning algorithms namely, Naive Bayes, Maximum Entropy and SVM. The proposed Cosine Similarity based classifier gives 82.09% accuracy for the 2-class problem of identifying positive and negative sentiments. It outperforms all other classifiers by a considerable margin in the 5-class sentiment classification problem with an accuracy of 71.5%. The same strategy is also used for categorizing each user comment into six different Telecom specific categories.","PeriodicalId":161306,"journal":{"name":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","volume":"83 22","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ReTIS.2015.7232847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
The opinion of other people is often a major factor influencing our decisions. For a consumer it affects purchase decisions and for a producer or a service provider it helps in making business decisions. Companies spend a lot of money and time on surveys for gathering the public opinion on products and services. Now-a-days the web has become a hotspot for finding user opinions on almost anything under the sun. Both money and time can be saved by mining opinions from the web. Moreover, no survey can have a sample size, which can match that of the web. Each opinion generally expresses either positive, negative or neutral sentiment. The task of identifying these sentiments is called Sentiment Analysis. This work deals with the analysis of user sentiments in the Telecom domain. Since no such related standard database of users' opinions could be found, we developed one by mining the WWW. A major issue with these sample comments is that these are usually extremely noisy, containing numerous spelling and grammatical errors, acronyms, abbreviations, shortened or slang words etc. Such data cannot be used directly for analyzing sentiments. Hence, a lexicon based preprocessing algorithm is proposed for noise reduction. A novel idea based on Cosine Similarity measure is proposed for classifying the sentiment expressed by a user's comment into a five point scale of -2 (highly negative) to +2 (highly positive). The performance of the proposed strategy is compared with some of the well-known machine learning algorithms namely, Naive Bayes, Maximum Entropy and SVM. The proposed Cosine Similarity based classifier gives 82.09% accuracy for the 2-class problem of identifying positive and negative sentiments. It outperforms all other classifiers by a considerable margin in the 5-class sentiment classification problem with an accuracy of 71.5%. The same strategy is also used for categorizing each user comment into six different Telecom specific categories.