{"title":"基于数据可视化平台的仇恨言论现象研究、分析和理解","authors":"A. Capozzi, V. Patti, G. Ruffo, C. Bosco","doi":"10.1145/3240431.3240437","DOIUrl":null,"url":null,"abstract":"In this paper we present a data visualization platform designed to support the Natural Language Processing (NLP) scholar to study and analyze different corpora collected with the purpose to understand the hate speech phenomenon in social media. The project started with the creation of a corpus which collects tweets addressed to specific groups of ethnic minorities considered very controversial in the Italian public debate. Each tweet has been manually tagged with a series of attributes in order to capture the different features used to characterize the hate speech phenomenon. This corpus is mainly built to be used for training an automatic classifier and helping us in its testing and validation, before being it adopted to detect tweets targeted as hate speech on larger scale datasets. As opposed as many other traditional machine learning tasks, to build a good classifier achieving high scores in terms of accuracy is very challenging in such scenario, because of the intrinsic ambiguity of the language, the lack of a proper and explicable context in social media, and the attitude of on line users of being sarcastic and ironical. Therefore, in order to properly validate an effective feature selection process, correlations between selected attributes must be studied and analyzed. This motivated us to build an interactive platform to explore data in our corpora across the dimensions that have been used to characterize collected tweets. In our paper, after a brief introduction of the hate speech dataset, we will show how the dashboard can fit into the NLP pipeline, and how its architecture can be structured. Finally, we will present some of the challenges we have faced to visualize data with spatial, temporal and numerical attributes.","PeriodicalId":147028,"journal":{"name":"Proceedings of the 2nd International Conference on Web Studies","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Data Viz Platform as a Support to Study, Analyze and Understand the Hate Speech Phenomenon\",\"authors\":\"A. Capozzi, V. Patti, G. Ruffo, C. Bosco\",\"doi\":\"10.1145/3240431.3240437\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we present a data visualization platform designed to support the Natural Language Processing (NLP) scholar to study and analyze different corpora collected with the purpose to understand the hate speech phenomenon in social media. The project started with the creation of a corpus which collects tweets addressed to specific groups of ethnic minorities considered very controversial in the Italian public debate. Each tweet has been manually tagged with a series of attributes in order to capture the different features used to characterize the hate speech phenomenon. This corpus is mainly built to be used for training an automatic classifier and helping us in its testing and validation, before being it adopted to detect tweets targeted as hate speech on larger scale datasets. As opposed as many other traditional machine learning tasks, to build a good classifier achieving high scores in terms of accuracy is very challenging in such scenario, because of the intrinsic ambiguity of the language, the lack of a proper and explicable context in social media, and the attitude of on line users of being sarcastic and ironical. Therefore, in order to properly validate an effective feature selection process, correlations between selected attributes must be studied and analyzed. This motivated us to build an interactive platform to explore data in our corpora across the dimensions that have been used to characterize collected tweets. In our paper, after a brief introduction of the hate speech dataset, we will show how the dashboard can fit into the NLP pipeline, and how its architecture can be structured. Finally, we will present some of the challenges we have faced to visualize data with spatial, temporal and numerical attributes.\",\"PeriodicalId\":147028,\"journal\":{\"name\":\"Proceedings of the 2nd International Conference on Web Studies\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd International Conference on Web Studies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3240431.3240437\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Web Studies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240431.3240437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Data Viz Platform as a Support to Study, Analyze and Understand the Hate Speech Phenomenon
In this paper we present a data visualization platform designed to support the Natural Language Processing (NLP) scholar to study and analyze different corpora collected with the purpose to understand the hate speech phenomenon in social media. The project started with the creation of a corpus which collects tweets addressed to specific groups of ethnic minorities considered very controversial in the Italian public debate. Each tweet has been manually tagged with a series of attributes in order to capture the different features used to characterize the hate speech phenomenon. This corpus is mainly built to be used for training an automatic classifier and helping us in its testing and validation, before being it adopted to detect tweets targeted as hate speech on larger scale datasets. As opposed as many other traditional machine learning tasks, to build a good classifier achieving high scores in terms of accuracy is very challenging in such scenario, because of the intrinsic ambiguity of the language, the lack of a proper and explicable context in social media, and the attitude of on line users of being sarcastic and ironical. Therefore, in order to properly validate an effective feature selection process, correlations between selected attributes must be studied and analyzed. This motivated us to build an interactive platform to explore data in our corpora across the dimensions that have been used to characterize collected tweets. In our paper, after a brief introduction of the hate speech dataset, we will show how the dashboard can fit into the NLP pipeline, and how its architecture can be structured. Finally, we will present some of the challenges we have faced to visualize data with spatial, temporal and numerical attributes.