{"title":"DeepScraper: A complete and efficient tweet scraping method using authenticated multiprocessing","authors":"Jaebeom You , Kisung Lee , Hyuk-Yoon Kwon","doi":"10.1016/j.datak.2023.102260","DOIUrl":null,"url":null,"abstract":"<div><p>In this paper, we propose a scraping method for collecting tweets, which we call <em>DeepScraper</em><span>. DeepScraper provides the complete scraping for the entire tweets written by a certain group of users or them containing search keywords<span> with a fast speed. To improve the crawling speed of DeepScraper, we devise a multiprocessing architecture while providing authentication<span> to the multiple processes based on the simulation of the user access behavior to Twitter. This allows us to maximize the parallelism of crawling even in a single machine. Through extensive experiments, we show that DeepScraper can crawl the entire tweets of 99 users, which amounts to 5,798,052 tweets while Twitter standard API can crawl only 243,650 tweets of them due to the constraints of the number of tweets to scrape. In other words, DeepScraper could collect 23.7 times more tweets for the 99 users than the standard API. We also show the efficiency of DeepScraper. First, we show the effect of the authenticated multiprocessing by showing that it increases the crawling speed from 2.03</span></span></span><span><math><mo>∼</mo></math></span>10.57 times as the number of running processes increases from 2 to 32 compared to DeepScraper with a single process. Then, we compare the crawling speed of DeepScraper with the existing studies. The result shows that DeepScraper is compared to even Twitter standard APIs and Twitter4J while DeepScraper can scrape much more tweets than them. Furthermore, DeepScraper is much faster than Twitter Scrapy roughly 3.69 times while both can scrape the entire tweets for the target users or keywords.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"149 ","pages":"Article 102260"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X23001209","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we propose a scraping method for collecting tweets, which we call DeepScraper. DeepScraper provides the complete scraping for the entire tweets written by a certain group of users or them containing search keywords with a fast speed. To improve the crawling speed of DeepScraper, we devise a multiprocessing architecture while providing authentication to the multiple processes based on the simulation of the user access behavior to Twitter. This allows us to maximize the parallelism of crawling even in a single machine. Through extensive experiments, we show that DeepScraper can crawl the entire tweets of 99 users, which amounts to 5,798,052 tweets while Twitter standard API can crawl only 243,650 tweets of them due to the constraints of the number of tweets to scrape. In other words, DeepScraper could collect 23.7 times more tweets for the 99 users than the standard API. We also show the efficiency of DeepScraper. First, we show the effect of the authenticated multiprocessing by showing that it increases the crawling speed from 2.0310.57 times as the number of running processes increases from 2 to 32 compared to DeepScraper with a single process. Then, we compare the crawling speed of DeepScraper with the existing studies. The result shows that DeepScraper is compared to even Twitter standard APIs and Twitter4J while DeepScraper can scrape much more tweets than them. Furthermore, DeepScraper is much faster than Twitter Scrapy roughly 3.69 times while both can scrape the entire tweets for the target users or keywords.
期刊介绍:
Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.