Thais G. Almeida, Bruno A. Souza, F. Nakamura, E. Nakamura
{"title":"在短评论中检测仇恨、攻击性和常规言论","authors":"Thais G. Almeida, Bruno A. Souza, F. Nakamura, E. Nakamura","doi":"10.1145/3126858.3131576","DOIUrl":null,"url":null,"abstract":"The freedom of expression provided by the Internet also favors malicious groups that propagate contents of hate, recruit new members, and threaten users. In this context, we propose a new approach for hate speech identification based on Information Theory quantifiers (entropy and divergence) to represent documents. As a differential of our approach, we capture weighted information of words, rather than just their frequency in documents. The results show that our approach overperforms techniques that use data representation, such as TF-IDF and unigrams combined to text classifiers, achieving an F1-score of 86%, 84% e 96% for classifying hate, offensive, and regular speech classes, respectively. Compared to the baselines, our proposal is a win-win solution that improves efficacy (F1-score) and efficiency (by reducing the dimension of the feature vector). The proposed solution is up to 2.27 times faster than the baseline.","PeriodicalId":338362,"journal":{"name":"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web","volume":"274 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Detecting Hate, Offensive, and Regular Speech in Short Comments\",\"authors\":\"Thais G. Almeida, Bruno A. Souza, F. Nakamura, E. Nakamura\",\"doi\":\"10.1145/3126858.3131576\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The freedom of expression provided by the Internet also favors malicious groups that propagate contents of hate, recruit new members, and threaten users. In this context, we propose a new approach for hate speech identification based on Information Theory quantifiers (entropy and divergence) to represent documents. As a differential of our approach, we capture weighted information of words, rather than just their frequency in documents. The results show that our approach overperforms techniques that use data representation, such as TF-IDF and unigrams combined to text classifiers, achieving an F1-score of 86%, 84% e 96% for classifying hate, offensive, and regular speech classes, respectively. Compared to the baselines, our proposal is a win-win solution that improves efficacy (F1-score) and efficiency (by reducing the dimension of the feature vector). The proposed solution is up to 2.27 times faster than the baseline.\",\"PeriodicalId\":338362,\"journal\":{\"name\":\"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web\",\"volume\":\"274 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3126858.3131576\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd Brazillian Symposium on Multimedia and the Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3126858.3131576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Detecting Hate, Offensive, and Regular Speech in Short Comments
The freedom of expression provided by the Internet also favors malicious groups that propagate contents of hate, recruit new members, and threaten users. In this context, we propose a new approach for hate speech identification based on Information Theory quantifiers (entropy and divergence) to represent documents. As a differential of our approach, we capture weighted information of words, rather than just their frequency in documents. The results show that our approach overperforms techniques that use data representation, such as TF-IDF and unigrams combined to text classifiers, achieving an F1-score of 86%, 84% e 96% for classifying hate, offensive, and regular speech classes, respectively. Compared to the baselines, our proposal is a win-win solution that improves efficacy (F1-score) and efficiency (by reducing the dimension of the feature vector). The proposed solution is up to 2.27 times faster than the baseline.