{"title":"Language agnostic model: detecting islamophobic content on social media","authors":"Heena Khan, Joshua L. Phillips","doi":"10.1145/3409334.3452077","DOIUrl":null,"url":null,"abstract":"Social media platforms can struggle to enforce rules preventing online abuse and hate speech due to the large amount of content that must be manually reviewed. Machine learning approaches have been proposed in the literature as a way to automate much of these labors, but social content in multiple languages further complicates this issue. Past work has focused on first building word embeddings in the target language which limits the application of such embeddings to other languages. We use the Google Neural Machine Translator (NMT) to identify and translate Non-English text to English to make the system language agnostic. We can therefore use already available pre-trained word embeddings, instead of training our models and word embeddings in different languages. We have experimented with different word-embedding and classifier pairs as we aimed to assess whether translated English data gives us accuracy comparable to an untranslated English dataset. Our best performing model, SVM with TF-IDF, gave us a 10-fold accuracy of 95.56 percent followed by the BERT model with a 10-fold accuracy of 94.66 percent on the translated data. This accuracy is close to the accuracy of the untranslated English dataset and far better than the accuracy of the untranslated Hindi dataset.","PeriodicalId":148741,"journal":{"name":"Proceedings of the 2021 ACM Southeast Conference","volume":"105 5","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM Southeast Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3409334.3452077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Social media platforms can struggle to enforce rules preventing online abuse and hate speech due to the large amount of content that must be manually reviewed. Machine learning approaches have been proposed in the literature as a way to automate much of these labors, but social content in multiple languages further complicates this issue. Past work has focused on first building word embeddings in the target language which limits the application of such embeddings to other languages. We use the Google Neural Machine Translator (NMT) to identify and translate Non-English text to English to make the system language agnostic. We can therefore use already available pre-trained word embeddings, instead of training our models and word embeddings in different languages. We have experimented with different word-embedding and classifier pairs as we aimed to assess whether translated English data gives us accuracy comparable to an untranslated English dataset. Our best performing model, SVM with TF-IDF, gave us a 10-fold accuracy of 95.56 percent followed by the BERT model with a 10-fold accuracy of 94.66 percent on the translated data. This accuracy is close to the accuracy of the untranslated English dataset and far better than the accuracy of the untranslated Hindi dataset.