Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas
{"title":"基于半监督机器学习的短文本多语言情感分析","authors":"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas","doi":"10.1145/3485768.3485775","DOIUrl":null,"url":null,"abstract":"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.","PeriodicalId":328771,"journal":{"name":"2021 5th International Conference on E-Society, E-Education and E-Technology","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning\",\"authors\":\"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas\",\"doi\":\"10.1145/3485768.3485775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.\",\"PeriodicalId\":328771,\"journal\":{\"name\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3485768.3485775\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on E-Society, E-Education and E-Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3485768.3485775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning
Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.