Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün
{"title":"地理标记Twitter数据的分布式情感分析","authors":"Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün","doi":"10.1109/SIU55565.2022.9864702","DOIUrl":null,"url":null,"abstract":"The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.","PeriodicalId":115446,"journal":{"name":"2022 30th Signal Processing and Communications Applications Conference (SIU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Sentiment Analysis for Geo-Tagged Twitter Data\",\"authors\":\"Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün\",\"doi\":\"10.1109/SIU55565.2022.9864702\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.\",\"PeriodicalId\":115446,\"journal\":{\"name\":\"2022 30th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 30th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU55565.2022.9864702\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU55565.2022.9864702","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Distributed Sentiment Analysis for Geo-Tagged Twitter Data
The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.