{"title":"Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages","authors":"Manpreet Kaur, Munish Saini","doi":"10.1145/3677176","DOIUrl":null,"url":null,"abstract":"The appearance of inflammatory language on social media by college or university students is quite prevalent, inspiring platforms to engage in community safety mechanisms. Escalating hate speech entails creating sophisticated artificial intelligence-based, machine learning, and deep learning algorithms to detect offensive internet content. With a few noteworthy exceptions, the majority of the studies on automatic hate speech recognition have emphasized high-resource languages, mainly English. We bridge this gap by addressing hate speech detection in Punjabi (Gurmukhi), a low-resource Indo-Aryan language articulated in Indian educational institutions. This research identifies cross-lingual hate speech in the code-switched English-Punjabi language used on social media. It proposes an approach combining the best hate speech detection techniques to cover existing state-of-art system gaps and limitations. In this method, the Roman Punjabi is transliterated, and then Bidirectional Encoder Representations from Transformer (BERT) based models are employed for hate detection. The proposed model has achieved 0.86 precision and 0.83 recall, and various higher educational institutions could employ it to discover the issues/domains where hate prevails the most.","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":null,"pages":null},"PeriodicalIF":1.8000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3677176","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The appearance of inflammatory language on social media by college or university students is quite prevalent, inspiring platforms to engage in community safety mechanisms. Escalating hate speech entails creating sophisticated artificial intelligence-based, machine learning, and deep learning algorithms to detect offensive internet content. With a few noteworthy exceptions, the majority of the studies on automatic hate speech recognition have emphasized high-resource languages, mainly English. We bridge this gap by addressing hate speech detection in Punjabi (Gurmukhi), a low-resource Indo-Aryan language articulated in Indian educational institutions. This research identifies cross-lingual hate speech in the code-switched English-Punjabi language used on social media. It proposes an approach combining the best hate speech detection techniques to cover existing state-of-art system gaps and limitations. In this method, the Roman Punjabi is transliterated, and then Bidirectional Encoder Representations from Transformer (BERT) based models are employed for hate detection. The proposed model has achieved 0.86 precision and 0.83 recall, and various higher educational institutions could employ it to discover the issues/domains where hate prevails the most.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.