Web Scraping versus Twitter API: A Comparison for a Credibility Analysis

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services Pub Date : 2020-11-30 DOI:10.1145/3428757.3429104

Irvin Dongo, Yudith Cadinale, A. Aguilera, F. Martínez, Yuni Quintero, Sergio Barrios

{"title":"Web Scraping versus Twitter API: A Comparison for a Credibility Analysis","authors":"Irvin Dongo, Yudith Cadinale, A. Aguilera, F. Martínez, Yuni Quintero, Sergio Barrios","doi":"10.1145/3428757.3429104","DOIUrl":null,"url":null,"abstract":"Twitter is one of the most popular information source available on the Web. Thus, there exist many studies focused on analyzing the credibility of the shared information. Most proposals use either Twitter API or web scraping to extract the data to perform such analysis. Both extraction techniques have advantages and disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted on the Web. To do so, we develop a framework which offers both alternatives of data extraction and implements a previously proposed credibility model. Our framework is implemented as a Google Chrome extension able to analyze tweets in real-time. Results report that both methods produce identical credibility values, when a robust normalization process is applied to the text (i.e., tweet). Moreover, concerning the time performance, web scraping is faster than Twitter API, and it is more flexible in terms of obtaining data; however, web scraping is very sensitive to website changes.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428757.3429104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Twitter is one of the most popular information source available on the Web. Thus, there exist many studies focused on analyzing the credibility of the shared information. Most proposals use either Twitter API or web scraping to extract the data to perform such analysis. Both extraction techniques have advantages and disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted on the Web. To do so, we develop a framework which offers both alternatives of data extraction and implements a previously proposed credibility model. Our framework is implemented as a Google Chrome extension able to analyze tweets in real-time. Results report that both methods produce identical credibility values, when a robust normalization process is applied to the text (i.e., tweet). Moreover, concerning the time performance, web scraping is faster than Twitter API, and it is more flexible in terms of obtaining data; however, web scraping is very sensitive to website changes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

网页抓取与Twitter API:可信度分析的比较

Twitter是网络上最受欢迎的信息来源之一。因此，有许多研究集中在分析共享信息的可信度上。大多数建议使用Twitter API或web抓取来提取数据以执行此类分析。两种提取技术各有优缺点。在这项工作中，我们提出了一项研究来评估他们的表现和行为。这项研究的动机来自于有必要知道如何提取在线信息，以便实时分析网络上发布的内容的可信度。为此，我们开发了一个框架，该框架提供了数据提取的两种替代方案，并实现了先前提出的可信度模型。我们的框架是作为一个能够实时分析推文的谷歌Chrome扩展实现的。结果报告，两种方法产生相同的可信度值，当一个稳健的规范化过程应用到文本(即，推文)。此外，在时间性能方面，web抓取比Twitter API更快，在获取数据方面更灵活;然而，网页抓取对网站的变化非常敏感。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services

自引率

0.00%

发文量

期刊最新文献

Tailored Graph Embeddings for Entity Alignment on Historical Data CommunityCare A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query Answering Prediction of Cesarean Childbirth using Ensemble Machine Learning Methods Interoperability of Semantically-Enabled Web Services on the WoT: Challenges and Prospects