{"title":"Lessons Learned: A Case Study in Creating a Data Pipeline using Twitter’s API","authors":"Jason Tiezzi, Rice Tyler, Suchetha Sharma","doi":"10.1109/SIEDS49339.2020.9106584","DOIUrl":null,"url":null,"abstract":"With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.","PeriodicalId":331495,"journal":{"name":"2020 Systems and Information Engineering Design Symposium (SIEDS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIEDS49339.2020.9106584","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
With over 300 million users, including frequent postings by elites across the political and entertainment fields, Twitter has become a rich field for mining and analyzing data. Despite its prominence within social science research, relatively little attention has been paid to the process of data acquisition. To that end, our research uses a case study to illustrate the process of acquiring and storing tweets. To construct our data pipeline, we first applied for and created Twitter developer accounts and used the Tweepy app in Python to interact with Twitter’s API. We created a program that uses a producer-consumer multithreading model to request tweets from the API, then cleans the data and pushes it to a MySQL database with four tables: one for tweets, one for user information, one for retweets, and one for special entities (e.g., hashtags).With our pipeline operational, we explore how candidate gender affects Twitter discourse in the 2020 Democratic presidential primary. Specifically, we use unsupervised text analysis methods to examine differences in word frequencies, sentiment, and emotional dimensions. We find that gender is central to the discourse surrounding female candidates, but peripheral for male candidates. The discourse surrounding female candidates in our dataset is also more joyful and positive. Finally, with our case study concluded, we offer lessons to future researchers who wish to acquire and utilize Twitter data for social science research.