Web Scraping versus Twitter API: A Comparison for a Credibility Analysis

Irvin Dongo, Yudith Cadinale, A. Aguilera, F. Martínez, Yuni Quintero, Sergio Barrios
{"title":"Web Scraping versus Twitter API: A Comparison for a Credibility Analysis","authors":"Irvin Dongo, Yudith Cadinale, A. Aguilera, F. Martínez, Yuni Quintero, Sergio Barrios","doi":"10.1145/3428757.3429104","DOIUrl":null,"url":null,"abstract":"Twitter is one of the most popular information source available on the Web. Thus, there exist many studies focused on analyzing the credibility of the shared information. Most proposals use either Twitter API or web scraping to extract the data to perform such analysis. Both extraction techniques have advantages and disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted on the Web. To do so, we develop a framework which offers both alternatives of data extraction and implements a previously proposed credibility model. Our framework is implemented as a Google Chrome extension able to analyze tweets in real-time. Results report that both methods produce identical credibility values, when a robust normalization process is applied to the text (i.e., tweet). Moreover, concerning the time performance, web scraping is faster than Twitter API, and it is more flexible in terms of obtaining data; however, web scraping is very sensitive to website changes.","PeriodicalId":212557,"journal":{"name":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3428757.3429104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Twitter is one of the most popular information source available on the Web. Thus, there exist many studies focused on analyzing the credibility of the shared information. Most proposals use either Twitter API or web scraping to extract the data to perform such analysis. Both extraction techniques have advantages and disadvantages. In this work, we present a study to evaluate their performance and behavior. The motivation for this research comes from the necessity to know ways to extract online information in order to analyze in real-time the credibility of the content posted on the Web. To do so, we develop a framework which offers both alternatives of data extraction and implements a previously proposed credibility model. Our framework is implemented as a Google Chrome extension able to analyze tweets in real-time. Results report that both methods produce identical credibility values, when a robust normalization process is applied to the text (i.e., tweet). Moreover, concerning the time performance, web scraping is faster than Twitter API, and it is more flexible in terms of obtaining data; however, web scraping is very sensitive to website changes.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
网页抓取与Twitter API:可信度分析的比较
Twitter是网络上最受欢迎的信息来源之一。因此,有许多研究集中在分析共享信息的可信度上。大多数建议使用Twitter API或web抓取来提取数据以执行此类分析。两种提取技术各有优缺点。在这项工作中,我们提出了一项研究来评估他们的表现和行为。这项研究的动机来自于有必要知道如何提取在线信息,以便实时分析网络上发布的内容的可信度。为此,我们开发了一个框架,该框架提供了数据提取的两种替代方案,并实现了先前提出的可信度模型。我们的框架是作为一个能够实时分析推文的谷歌Chrome扩展实现的。结果报告,两种方法产生相同的可信度值,当一个稳健的规范化过程应用到文本(即,推文)。此外,在时间性能方面,web抓取比Twitter API更快,在获取数据方面更灵活;然而,网页抓取对网站的变化非常敏感。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Tailored Graph Embeddings for Entity Alignment on Historical Data CommunityCare A Comparison of Two Database Partitioning Approaches that Support Taxonomy-Based Query Answering Prediction of Cesarean Childbirth using Ensemble Machine Learning Methods Interoperability of Semantically-Enabled Web Services on the WoT: Challenges and Prospects
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1