从URL记录的网络浏览数字跟踪数据中量化网络抓取内容可访问性和不可访问性的系统性偏差

Ross Dahlke, Deepak Kumar, Z. Durumeric, Jeffrey T. Hancock
{"title":"从URL记录的网络浏览数字跟踪数据中量化网络抓取内容可访问性和不可访问性的系统性偏差","authors":"Ross Dahlke, Deepak Kumar, Z. Durumeric, Jeffrey T. Hancock","doi":"10.1177/08944393231218214","DOIUrl":null,"url":null,"abstract":"Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.","PeriodicalId":506768,"journal":{"name":"Social Science Computer Review","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content From URL-Logged Web-Browsing Digital Trace Data\",\"authors\":\"Ross Dahlke, Deepak Kumar, Z. Durumeric, Jeffrey T. Hancock\",\"doi\":\"10.1177/08944393231218214\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.\",\"PeriodicalId\":506768,\"journal\":{\"name\":\"Social Science Computer Review\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Social Science Computer Review\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/08944393231218214\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Social Science Computer Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/08944393231218214","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

社会科学家和计算机科学家越来越多地使用观察到的数字痕迹数据,并对这些数据进行事后分析,以了解人们在网上接触到的内容。然而,如果不能追溯性地获取全部数据,这些内容收集工作就可能存在系统性偏差。我们将这种经常未说明的假设称为有问题的可访问性假设。为了研究这一假设在多大程度上可能存在问题,我们识别了一个由 1238 名美国成年人组成的代表性小组访问过的 107k 硬新闻和错误信息网页,并记录了个人通过成功的网络搜刮可访问或通过不成功的搜刮不可访问的网页的程度。虽然我们发现收集到的 URL 大部分都可以访问且内容不受限制,但我们发现 URL 受限制、返回错误或无法访问的情况存在系统性偏差。例如,保守的错误信息 URL 比其他类型的错误信息更容易无法访问。我们建议社会科学家应如何捕捉和报告数字跟踪和网络搜刮数据。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content From URL-Logged Web-Browsing Digital Trace Data
Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Comparative Digital Political Communication: Comparisons Across Countries, Platforms, and Time Platform Convergence or Divergence? Comparing Political Ad Content Across Digital and Social Media Platforms Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach The Dark Sides of AI Advertising: The Integration of Cognitive Appraisal Theory and Information Quality Theory The Re-mediation of Legacy and New Media on Twitter: A Six-Language Comparison of the European Social Media Discourse on Migration
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1