印尼语中的仇恨言论检测:挑战与机遇

IF 0.9 Q3 COMPUTER SCIENCE, THEORY & METHODS International Journal of Advanced Computer Science and Applications Pub Date : 2023-01-01 DOI:10.14569/ijacsa.2023.01406125

Endang Wahyu Pamungkas, Divi Galih Prasetyo Putri, A. Fatmawati

{"title":"印尼语中的仇恨言论检测:挑战与机遇","authors":"Endang Wahyu Pamungkas, Divi Galih Prasetyo Putri, A. Fatmawati","doi":"10.14569/ijacsa.2023.01406125","DOIUrl":null,"url":null,"abstract":"This study aims to provide an overview of the current research on detecting abusive language in Indonesian social media. The study examines existing datasets, methods, and challenges and opportunities in this field. The research found that most existing datasets for detecting abusive language were collected from social media platforms such as Twitter, Facebook, and Instagram, with Twitter being the most commonly used source. The study also found that hate speech is the most researched type of abusive language. Various models, including traditional machine learning and deep learning approaches, have been implemented for this task, with deep learning models showing more competitive results. However, the use of transformer-based models is less popular in Indonesian hate speech studies. The study also emphasizes the importance of exploring more diverse phenomena, such as islamophobia and political hate speech. Additionally, the study suggests crowdsourcing as a potential solution for the annotation approach for labeling datasets. Furthermore, it encourages researchers to consider code-mixing issues in abusive language datasets in Indonesia, as it could improve the overall model performance for detecting abusive language in Indonesian data. The study also suggests that the lack of effective regulations and the anonymity afforded to users on most social networking sites, as well as the increasing number of Twitter users in Indonesia, have contributed to the rising prevalence of hate speech in Indonesian social media. The study also notes the importance of considering code-mixed language, out-of-vocabulary words, grammatical errors, and limited context when working with social media data. Keywords—Abusive language; hate speech detection; machine learning; social media","PeriodicalId":13824,"journal":{"name":"International Journal of Advanced Computer Science and Applications","volume":"5 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities\",\"authors\":\"Endang Wahyu Pamungkas, Divi Galih Prasetyo Putri, A. Fatmawati\",\"doi\":\"10.14569/ijacsa.2023.01406125\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study aims to provide an overview of the current research on detecting abusive language in Indonesian social media. The study examines existing datasets, methods, and challenges and opportunities in this field. The research found that most existing datasets for detecting abusive language were collected from social media platforms such as Twitter, Facebook, and Instagram, with Twitter being the most commonly used source. The study also found that hate speech is the most researched type of abusive language. Various models, including traditional machine learning and deep learning approaches, have been implemented for this task, with deep learning models showing more competitive results. However, the use of transformer-based models is less popular in Indonesian hate speech studies. The study also emphasizes the importance of exploring more diverse phenomena, such as islamophobia and political hate speech. Additionally, the study suggests crowdsourcing as a potential solution for the annotation approach for labeling datasets. Furthermore, it encourages researchers to consider code-mixing issues in abusive language datasets in Indonesia, as it could improve the overall model performance for detecting abusive language in Indonesian data. The study also suggests that the lack of effective regulations and the anonymity afforded to users on most social networking sites, as well as the increasing number of Twitter users in Indonesia, have contributed to the rising prevalence of hate speech in Indonesian social media. The study also notes the importance of considering code-mixed language, out-of-vocabulary words, grammatical errors, and limited context when working with social media data. Keywords—Abusive language; hate speech detection; machine learning; social media\",\"PeriodicalId\":13824,\"journal\":{\"name\":\"International Journal of Advanced Computer Science and Applications\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Advanced Computer Science and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14569/ijacsa.2023.01406125\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advanced Computer Science and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14569/ijacsa.2023.01406125","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

本研究旨在概述当前印尼社交媒体中检测辱骂语言的研究。该研究考察了该领域现有的数据集、方法以及挑战和机遇。研究发现，大多数现有的用于检测辱骂性语言的数据集都是从Twitter、Facebook和Instagram等社交媒体平台收集的，其中Twitter是最常用的来源。该研究还发现，仇恨言论是研究最多的辱骂语言类型。各种模型，包括传统的机器学习和深度学习方法，已经实现了这项任务，深度学习模型显示出更有竞争力的结果。然而，在印度尼西亚的仇恨言论研究中，使用基于转换的模型不太受欢迎。该研究还强调了探索更多不同现象的重要性，比如伊斯兰恐惧症和政治仇恨言论。此外，该研究建议将众包作为标注数据集的注释方法的潜在解决方案。此外，它鼓励研究人员考虑印度尼西亚滥用语言数据集中的代码混合问题，因为它可以提高印度尼西亚数据中检测滥用语言的整体模型性能。该研究还表明，缺乏有效的监管，大多数社交网站上用户的匿名性，以及印尼Twitter用户数量的增加，都导致了印尼社交媒体上仇恨言论的盛行。该研究还指出，在处理社交媒体数据时，考虑代码混合语言、词汇外单词、语法错误和有限上下文的重要性。Keywords-Abusive语言;仇恨语音检测;机器学习;社交媒体

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities

This study aims to provide an overview of the current research on detecting abusive language in Indonesian social media. The study examines existing datasets, methods, and challenges and opportunities in this field. The research found that most existing datasets for detecting abusive language were collected from social media platforms such as Twitter, Facebook, and Instagram, with Twitter being the most commonly used source. The study also found that hate speech is the most researched type of abusive language. Various models, including traditional machine learning and deep learning approaches, have been implemented for this task, with deep learning models showing more competitive results. However, the use of transformer-based models is less popular in Indonesian hate speech studies. The study also emphasizes the importance of exploring more diverse phenomena, such as islamophobia and political hate speech. Additionally, the study suggests crowdsourcing as a potential solution for the annotation approach for labeling datasets. Furthermore, it encourages researchers to consider code-mixing issues in abusive language datasets in Indonesia, as it could improve the overall model performance for detecting abusive language in Indonesian data. The study also suggests that the lack of effective regulations and the anonymity afforded to users on most social networking sites, as well as the increasing number of Twitter users in Indonesia, have contributed to the rising prevalence of hate speech in Indonesian social media. The study also notes the importance of considering code-mixed language, out-of-vocabulary words, grammatical errors, and limited context when working with social media data. Keywords—Abusive language; hate speech detection; machine learning; social media

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Advanced Computer Science and Applications COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

2.30

自引率

22.20%

发文量

519

期刊介绍： IJACSA is a scholarly computer science journal representing the best in research. Its mission is to provide an outlet for quality research to be publicised and published to a global audience. The journal aims to publish papers selected through rigorous double-blind peer review to ensure originality, timeliness, relevance, and readability. In sync with the Journal''s vision "to be a respected publication that publishes peer reviewed research articles, as well as review and survey papers contributed by International community of Authors", we have drawn reviewers and editors from Institutions and Universities across the globe. A double blind peer review process is conducted to ensure that we retain high standards. At IJACSA, we stand strong because we know that global challenges make way for new innovations, new ways and new talent. International Journal of Advanced Computer Science and Applications publishes carefully refereed research, review and survey papers which offer a significant contribution to the computer science literature, and which are of interest to a wide audience. Coverage extends to all main-stream branches of computer science and related applications