Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir, Nikola S. Nikolov, Jamal Kharroubi
{"title":"加强种族主义分类:使用自我训练和 CNN 的自动多语言数据注释系统","authors":"Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir, Nikola S. Nikolov, Jamal Kharroubi","doi":"10.1007/s10618-024-01059-2","DOIUrl":null,"url":null,"abstract":"<p>Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.\n</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"31 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN\",\"authors\":\"Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir, Nikola S. Nikolov, Jamal Kharroubi\",\"doi\":\"10.1007/s10618-024-01059-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.\\n</p>\",\"PeriodicalId\":55183,\"journal\":{\"name\":\"Data Mining and Knowledge Discovery\",\"volume\":\"31 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Mining and Knowledge Discovery\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10618-024-01059-2\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01059-2","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN
Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.