马来网络霸凌推特语料库的情感、情感和毒性极性组合自动标签

R. Maskat, Muhammad Faizzuddin Zainal, Nurrissammimayantie Ismail, N. Ardi, Amirah Ahmad, N. Daud
{"title":"马来网络霸凌推特语料库的情感、情感和毒性极性组合自动标签","authors":"R. Maskat, Muhammad Faizzuddin Zainal, Nurrissammimayantie Ismail, N. Ardi, Amirah Ahmad, N. Daud","doi":"10.1145/3446132.3446412","DOIUrl":null,"url":null,"abstract":"Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine “bully” and “not bully” labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy.","PeriodicalId":125388,"journal":{"name":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities\",\"authors\":\"R. Maskat, Muhammad Faizzuddin Zainal, Nurrissammimayantie Ismail, N. Ardi, Amirah Ahmad, N. Daud\",\"doi\":\"10.1145/3446132.3446412\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine “bully” and “not bully” labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy.\",\"PeriodicalId\":125388,\"journal\":{\"name\":\"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"55 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3446132.3446412\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3446132.3446412","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

自动标注在大型语料库中是必不可少的。让人类专家来做标签是很有挑战性的。语义理解可以根据个人的语言能力从一个标签到另一个不同。像AmazonTurk这样的平台并不能保证每个领域的注释质量。可能会实施大量的步骤,如标签的鉴定和反检查,这将增加数据注释的成本。因此,期望的标记数据质量越高,需要花费的成本就越大。当语言资源不足时,这种情况会变得更糟,而在这项工作中使用的是马来语。马来语主要在马来西亚、印度尼西亚、新加坡和文莱使用。不像英语有大量的资源来挖掘句子的语义,使自动标签更快地成熟,马来语的资源仍然有限。更复杂的是社交媒体数据的使用,这些数据的文本很短,不规范,并且存在固有的代码转换。合格的马来本土贴标员的可用性也很稀缺。为了克服这个问题,我们设计了一种方法,通过结合情绪、情感和毒性极性,自动标记总共219,444条马来语推文。我们扩展了Arslan等人的工作,他们提出使用情绪和情感来识别网络欺凌文本。我们的工作在马来语的网络欺凌推文自动标签的背景下增加了毒性极性。我们聘请了5位拥有马来语正式学位的专家来标记我们的训练集。我们将此方法应用于马来网络欺凌语料库,以确定“欺凌”和“不欺凌”标签。我们已经在54,867个人工标记数据上测试了我们的方法,并取得了很高的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Automatic Labelling of Malay Cyberbullying Twitter Corpus using Combinations of Sentiment, Emotion and Toxicity Polarities
Automatic labelling is essential in large corpuses. Engaging in human experts to label can be challenging. Semantic understanding can differ from one labeler to another based on individual's language ability. Platforms such as AmazonTurk are not able to ensure the quality of annotations in every domain. Extensive steps such as qualification and counter checking of labels may be implemented which will increase the cost of data annotation. Thus, the higher quality of labelled data expected, the greater the cost that needs to be expended. This scenario is made worse when the language is of low resource where in this work is the Malay language. Malay is a language used mostly in Malaysia, Indonesia, Singapore and Brunei. Unlike English which has large resources to tap into the semantics of sentences, making automatic labelling faster to mature, resources in Malay language are still limited. Further compounded is the use of social media data where the text is short, unnormalized and the inherent presence of code switching. The availability of qualified native Malay labelers is also scarce. To overcome this, we devised a method to automatically label a total of 219,444 Malay tweets by using a combination of sentiment, emotion and toxicity polarities. We extend the work from Arslan et al. who proposed the use of sentiment and emotion to identify cyberbullying text. Our work added toxicity polarity in the context of automatic labelling of cyberbully tweets in Malay. We were able to employ 5 experts with formal degrees in Malay language to label our training set. We applied this method to Malay cyberbullying corpus to determine “bully” and “not bully” labels. We have tested our method on 54,867 manually labelled data and achieved high accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Lane Detection Combining Details and Integrity: an Advanced Method for Lane Detection The Cat's Eye Effect Target Recognition Method Based on deep convolutional neural network Leveraging Different Context for Response Generation through Topic-guided Multi-head Attention Siamese Multiplicative LSTM for Semantic Text Similarity Multi-constrained Vehicle Routing Problem Solution based on Adaptive Genetic Algorithm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1