地理标记Twitter数据的分布式情感分析

2022 30th Signal Processing and Communications Applications Conference (SIU) Pub Date : 2022-05-15 DOI:10.1109/SIU55565.2022.9864702

Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün

{"title":"地理标记Twitter数据的分布式情感分析","authors":"Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün","doi":"10.1109/SIU55565.2022.9864702","DOIUrl":null,"url":null,"abstract":"The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.","PeriodicalId":115446,"journal":{"name":"2022 30th Signal Processing and Communications Applications Conference (SIU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed Sentiment Analysis for Geo-Tagged Twitter Data\",\"authors\":\"Muhammed Said Zengin, Rabia Arslan, Mehmet Burak Akgün\",\"doi\":\"10.1109/SIU55565.2022.9864702\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.\",\"PeriodicalId\":115446,\"journal\":{\"name\":\"2022 30th Signal Processing and Communications Applications Conference (SIU)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 30th Signal Processing and Communications Applications Conference (SIU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIU55565.2022.9864702\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU55565.2022.9864702","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

社交媒体上不断增加的分享频率使这些平台成为计算社会科学研究的主要数据来源之一。同样，检查和分析大规模的社交媒体数据集对政府和公司都至关重要。然而，随着数据量的增加，需要使用基于人工智能的模型从数据中获得的见解在处理能力方面变得越来越苛刻。事实上，如果在实时或接近实时的限制下需要洞察，硬件需求可能会急剧增加。在这项研究中，我们开发了一个利用大型社交媒体数据集的分布式情感分析模型。已经收集了1600万条推文，并按发推城市进行了分组。情感分析模型是通过对预训练的BERT模型进行微调而产生的。使用分布式大数据分析引擎Apache Spark以分布式方式执行训练好的模型。为了评估目的，将单个计算单元上的预测时间与分布式预测时间进行比较。对81个省份对应的每个数据组分别执行情感分析模型。本研究中使用的包含1600万条tweet的数据集、生成的土耳其情绪分析模型、为Apache Spark开发的分布式预测代码以及所有研究结果都可以从https://distributed-sentiment-analysis.github.io/访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Distributed Sentiment Analysis for Geo-Tagged Twitter Data

The ever-increasing frequency of sharing on social media makes these platforms one of the primary sources of data for computational social science studies. Similarly, examining and analyzing large scale social media data-sets is crucial for governments as well as companies. However, as the amount of data increases, insights that need to be derived from the data using artificial intelligence based models becomes more and more demanding in terms of processing power. In fact, hardware requirements might dramatically increase if the insights are needed under real-time or near-real time constraints. In this study, we developed a distributed sentiment analysis model that utilizes a large social media data-set. 16 million tweets have been collected and grouped by the originating city. The sentiment analysis model was produced by fine-tuning the pre-trained BERT model. Distributed big data analytics engine, Apache Spark, is used to execute the trained model in a distributed fashion. For evaluation purposes, the prediction time on a single compute unit is compared with the distributed prediction time. Sentiment analysis model has been executed separately for each of the data-groups corresponding to 81 provinces. The data-set containing 16 million tweets used in this study, the Turkish sentiment analysis model produced, the distributed prediction code developed for Apache Spark and all the results of the study can be accessed from the address https://distributed-sentiment-analysis.github.io/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 30th Signal Processing and Communications Applications Conference (SIU)

自引率

0.00%

发文量

期刊最新文献

Traffic Prediction with Peak-Aware Temporal Graph Convolutional Networks Artificial Neural Network Based Fault Diagnostic System for Wind Turbines Remaining Useful Life Prediction on C-MAPSS Dataset via Joint Autoencoder-Regression Architecture A New Fast Walsh Hadamard Transform Spread UW-Optical-OFDM Waveform Indoor Localization with Transfer Learning