基于半监督机器学习的短文本多语言情感分析

Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas
{"title":"基于半监督机器学习的短文本多语言情感分析","authors":"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas","doi":"10.1145/3485768.3485775","DOIUrl":null,"url":null,"abstract":"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.","PeriodicalId":328771,"journal":{"name":"2021 5th International Conference on E-Society, E-Education and E-Technology","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning\",\"authors\":\"Joshua Lois Cruz Paulino, Lexter Carl Antoja Almirol, Jun Marco Cruz Favila, Kent Alvin Gerald Loria Aquino, Angelica Hernandez De La Cruz, R. Roxas\",\"doi\":\"10.1145/3485768.3485775\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.\",\"PeriodicalId\":328771,\"journal\":{\"name\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 5th International Conference on E-Society, E-Education and E-Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3485768.3485775\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 5th International Conference on E-Society, E-Education and E-Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3485768.3485775","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

情感分析是一项识别文本情感的任务,通常用于分析社交媒体、客户反馈和产品评论中的文本。各种研究已经探索了如何通过使用机器学习技术自动完成情感分析。然而,对多语言文本进行情感分析的尝试很少。此外,大多数现有工作使用标记数据来训练和开发用于情感分析的机器学习模型。使用标记数据通常既昂贵又耗时。本研究探索了一种基于半监督机器学习的多语言文本情感分析模型。使用的数据由50,788条关于COVID-19的推文组成,这些推文通过删除不必要的字符、停止词和表情符号进行清理。清理后,每条推文的语言被识别,所有不是用菲律宾语或英语写的推文都从数据集中删除。之后,这些推文都被翻译成英文,为注释阶段做准备。这项研究使用了一个开源工具TextBlob来注释推文。TextBlob以向量表示形式输出文本的极性。然后,TextBlob注释由人类专家通过评估者之间的协议进行验证。人类注释和TextBlob注释之间的一致性水平与0.78 Fleiss的Kappa值有很大的一致性。使用各种机器学习算法开发分类器模型。实验结果表明,以计数矢量器为特征的SVC是表现最好的模型,准确率、精密度、召回率和f1分数均达到95%。对于未来的工作,可以考虑微调超参数来优化模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multilingual Sentiment Analysis on Short Text Document Using Semi-Supervised Machine Learning
Sentiment analysis is a task of identifying the sentiments in text which is often applied to analyzing text in social media, customer feedbacks, and product reviews. Various studies have explored how sentiment analysis can automatically done by using machine learning techniques. However, there has been few attempts in implementing sentiment analysis on multilingual text. Furthermore, most of the existing works uses labelled data to train and develop machine learning models for sentiment analysis. Using labelled data are often expensive and time consuming. In this study, a sentiment analysis model for multilingual text using semi-supervised machine learning was explored. The data used is composed of 50,788 tweets about COVID-19, these are cleaned by removing unnecessary characters, stop words, and emojis. After cleaning, the language of each tweet was identified, all tweets that are not written in Filipino or English were removed from the dataset. Afterwards, the tweets were all translated in English in preparation for the annotation phase. This study used an open-source tool, TextBlob, in annotating the tweets. TextBlob outputs the polarity of the text in vector representation. The TextBlob annotation were then validated by human experts through an inter-rater agreement. The level of agreement between the human annotations and TextBlob annotations have a substantial agreement with 0.78 Fleiss’ Kappa value. Classifier models were developed using various machine learning algorithms. Based on the results of the experiment, SVC is the best performing model with count vectorizer as feature with an accuracy, precision, recall, and F1-score of 95%. For future work, fine tuning hyperparameters to optimize the models can be considered.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Strategic Research on the Teaching Application of Content and Language Integrated Learning Methods in the Chinese Language Region Does Credit Information Sharing Affect Corporate Cash Holdings?:Evidence from Chinese Listed Companies Student Readiness for Transformative Learning: A Case Study in a Vocational College Selecting Potential Medical Professional Ability Students in Chinese NCEE by Predicting GPA through Data Mining No-arbitrage Pricing of European Options based on Trinomial Tree Model
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1