基于半监督机器学习的短文本多语言情感分析

2021 5th International Conference on E-Society, E-Education and E-Technology Pub Date : 2021-08-21 DOI:10.1145/3485768.3485775

Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas

{"title":"基于半监督机器学习的短文本多语言情感分析","authors":"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas","doi":"10.1145/3485768.3485775","DOIUrl":null,"url":null,"abstract":"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.","PeriodicalId":328771,"journal":{"name":"2021 5th International Conference on E-Society, E-Education and E-Technology","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning\",\"authors\":\"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas\",\"doi\":\"10.1145/3485768.3485775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.\",\"PeriodicalId\":328771,\"journal\":{\"name\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3485768.3485775\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on E-Society, E-Education and E-Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3485768.3485775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

情感分析是一项识别文本情感的任务，通常用于分析社交媒体、客户反馈和产品评论中的文本。各种研究已经探索了如何通过使用机器学习技术自动完成情感分析。然而，对多语言文本进行情感分析的尝试很少。此外，大多数现有工作使用标记数据来训练和开发用于情感分析的机器学习模型。使用标记数据通常既昂贵又耗时。本研究探索了一种基于半监督机器学习的多语言文本情感分析模型。使用的数据由50,788条关于COVID-19的推文组成，这些推文通过删除不必要的字符、停止词和表情符号进行清理。清理后，每条推文的语言被识别，所有不是用菲律宾语或英语写的推文都从数据集中删除。之后，这些推文都被翻译成英文，为注释阶段做准备。这项研究使用了一个开源工具TextBlob来注释推文。TextBlob以向量表示形式输出文本的极性。然后，TextBlob注释由人类专家通过评估者之间的协议进行验证。人类注释和TextBlob注释之间的一致性水平与0.78 Fleiss的Kappa值有很大的一致性。使用各种机器学习算法开发分类器模型。实验结果表明，以计数矢量器为特征的SVC是表现最好的模型，准确率、精密度、召回率和f1分数均达到95%。对于未来的工作，可以考虑微调超参数来优化模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning

Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 5th International Conference on E-Society, E-Education and E-Technology

自引率

0.00%

发文量