使用深度学习转换器模型检测泰卢固语仇恨言论:语料库生成与评估

Namit Khanduja , Nishant Kumar , Arun Chauhan
{"title":"使用深度学习转换器模型检测泰卢固语仇恨言论:语料库生成与评估","authors":"Namit Khanduja ,&nbsp;Nishant Kumar ,&nbsp;Arun Chauhan","doi":"10.1016/j.sasc.2024.200112","DOIUrl":null,"url":null,"abstract":"<div><p>In today's digital era, social media has become a new tool for communication and sharing information, with the availability of high-speed internet it tends to reach the masses much faster. Lack of regulations and ethics have made advancement in the proliferation of abusive language and hate speech has become a growing concern on social media platforms in the form of posts, replies, and comments towards individuals, groups, religions, and communities. However, the process of classification of hate speech manually on online platforms is cumbersome and impractical due to the excessive amount of data being generated. Therefore, it is crucial to automatically filter online content to identify and eliminate hate speech from social media. Widely spoken resource-rich languages like English have driven the research and achieved the desired result due to the accessibility of large corpora, annotated datasets, and tools. Resource-constrained languages are not able to achieve the benefits of advancement due to a lack of data corpus and annotated datasets. India has diverse languages that change with demographics and languages that have limited data availability and semantic differences. Telugu is one of the low-resource Dravidian languages spoken in the southern part of India.</p><p>In this paper, we present a monolingual Telugu corpus consisting of tweets posted on Twitter annotated with hate and non-hate labels and experiments to provide a comparison of state-of-the-art fine-tuned deep learning models (mBERT, DistilBERT, IndicBERT, NLLB, Muril, RNN+LSTM, XLM-RoBERTa, and Indic-Bart). Through transfer learning and hyperparameter tuning, the models are compared for their effectiveness in classifying hate speech in Telugu text. The fine-tuned mBERT model outperformed all other fine-tuned models achieving an accuracy of 98.2. The authors also propose a deployment model for social media accounts.</p></div>","PeriodicalId":101205,"journal":{"name":"Systems and Soft Computing","volume":"6 ","pages":"Article 200112"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772941924000413/pdfft?md5=75eb56e1d4134dd28aab80ba7539f1c8&pid=1-s2.0-S2772941924000413-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation\",\"authors\":\"Namit Khanduja ,&nbsp;Nishant Kumar ,&nbsp;Arun Chauhan\",\"doi\":\"10.1016/j.sasc.2024.200112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In today's digital era, social media has become a new tool for communication and sharing information, with the availability of high-speed internet it tends to reach the masses much faster. Lack of regulations and ethics have made advancement in the proliferation of abusive language and hate speech has become a growing concern on social media platforms in the form of posts, replies, and comments towards individuals, groups, religions, and communities. However, the process of classification of hate speech manually on online platforms is cumbersome and impractical due to the excessive amount of data being generated. Therefore, it is crucial to automatically filter online content to identify and eliminate hate speech from social media. Widely spoken resource-rich languages like English have driven the research and achieved the desired result due to the accessibility of large corpora, annotated datasets, and tools. Resource-constrained languages are not able to achieve the benefits of advancement due to a lack of data corpus and annotated datasets. India has diverse languages that change with demographics and languages that have limited data availability and semantic differences. Telugu is one of the low-resource Dravidian languages spoken in the southern part of India.</p><p>In this paper, we present a monolingual Telugu corpus consisting of tweets posted on Twitter annotated with hate and non-hate labels and experiments to provide a comparison of state-of-the-art fine-tuned deep learning models (mBERT, DistilBERT, IndicBERT, NLLB, Muril, RNN+LSTM, XLM-RoBERTa, and Indic-Bart). Through transfer learning and hyperparameter tuning, the models are compared for their effectiveness in classifying hate speech in Telugu text. The fine-tuned mBERT model outperformed all other fine-tuned models achieving an accuracy of 98.2. The authors also propose a deployment model for social media accounts.</p></div>\",\"PeriodicalId\":101205,\"journal\":{\"name\":\"Systems and Soft Computing\",\"volume\":\"6 \",\"pages\":\"Article 200112\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2772941924000413/pdfft?md5=75eb56e1d4134dd28aab80ba7539f1c8&pid=1-s2.0-S2772941924000413-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Systems and Soft Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772941924000413\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Systems and Soft Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772941924000413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在当今的数字时代,社交媒体已成为沟通和分享信息的新工具,随着高速互联网的普及,它往往能更快地接触到大众。由于缺乏监管和道德规范,辱骂性语言泛滥成灾,仇恨言论在社交媒体平台上以针对个人、团体、宗教和社区的帖子、回复和评论的形式日益受到关注。然而,由于产生的数据量过大,在网络平台上手动对仇恨言论进行分类的过程既繁琐又不切实际。因此,自动过滤在线内容以识别和消除社交媒体中的仇恨言论至关重要。英语等广泛使用的资源丰富的语言由于拥有大量语料库、注释数据集和工具,推动了相关研究并取得了预期成果。资源有限的语言由于缺乏数据语料库和注释数据集,无法获得进步带来的好处。印度的语言多种多样,会随着人口结构的变化而变化,而且语言的数据可用性有限,语义也存在差异。在本文中,我们介绍了一个单语泰卢固语语料库,该语料库由 Twitter 上发布的带有仇恨和非仇恨标签的推文组成,并通过实验对最先进的微调深度学习模型(mBERT、DistilBERT、IndicBERT、NLLB、Muril、RNN+LSTM、XLM-RoBERTa 和 Indic-Bart)进行了比较。通过迁移学习和超参数调整,比较了这些模型在泰卢固文仇恨言论分类中的有效性。经过微调的 mBERT 模型的准确率达到了 98.2,超过了所有其他经过微调的模型。作者还提出了社交媒体账户的部署模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation

In today's digital era, social media has become a new tool for communication and sharing information, with the availability of high-speed internet it tends to reach the masses much faster. Lack of regulations and ethics have made advancement in the proliferation of abusive language and hate speech has become a growing concern on social media platforms in the form of posts, replies, and comments towards individuals, groups, religions, and communities. However, the process of classification of hate speech manually on online platforms is cumbersome and impractical due to the excessive amount of data being generated. Therefore, it is crucial to automatically filter online content to identify and eliminate hate speech from social media. Widely spoken resource-rich languages like English have driven the research and achieved the desired result due to the accessibility of large corpora, annotated datasets, and tools. Resource-constrained languages are not able to achieve the benefits of advancement due to a lack of data corpus and annotated datasets. India has diverse languages that change with demographics and languages that have limited data availability and semantic differences. Telugu is one of the low-resource Dravidian languages spoken in the southern part of India.

In this paper, we present a monolingual Telugu corpus consisting of tweets posted on Twitter annotated with hate and non-hate labels and experiments to provide a comparison of state-of-the-art fine-tuned deep learning models (mBERT, DistilBERT, IndicBERT, NLLB, Muril, RNN+LSTM, XLM-RoBERTa, and Indic-Bart). Through transfer learning and hyperparameter tuning, the models are compared for their effectiveness in classifying hate speech in Telugu text. The fine-tuned mBERT model outperformed all other fine-tuned models achieving an accuracy of 98.2. The authors also propose a deployment model for social media accounts.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.20
自引率
0.00%
发文量
0
期刊最新文献
A systematic assessment of sentiment analysis models on iraqi dialect-based texts Application of an intelligent English text classification model with improved KNN algorithm in the context of big data in libraries Analyzing the quality evaluation of college English teaching based on probabilistic linguistic multiple-attribute group decision-making Interior design assistant algorithm based on indoor scene analysis Research and application of visual synchronous positioning and mapping technology assisted by ultra wideband positioning technology
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1