On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl

J. Web Sci. Pub Date : 2016-07-25 DOI:10.1561/106.00000014
Sebastian Schelter, Jérôme Kunegis
{"title":"On the Ubiquity of Web Tracking: Insights from a Billion-Page Web Crawl","authors":"Sebastian Schelter, Jérôme Kunegis","doi":"10.1561/106.00000014","DOIUrl":null,"url":null,"abstract":"We perform a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5~billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest empirical web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of  domains and web pages analyzed. Due to the enormous size of the dataset, we are able to perform a large-scale study of online tracking, on three levels: (1) On a global level, we give a precise figure for the extent of tracking, give insights into the structural properties of the `online tracking sphere' and analyse which trackers (and subsequently, which companies) are used by how many websites. (2) On a country-specific level, we analyse which trackers are used by websites in different countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. (3) We answer the question whether the content of websites influences the choice of trackers they use, leveraging more than ninety thousand categorized domains. In particular, we analyse whether highly privacy-critical websites about health and addiction make different choices of trackers than other websites. Based on the performed analyses, we confirm that trackers are widespread (as expected), and that a small number of trackers dominates the web (Google, Facebook and Twitter).  In particular, the three tracking domains with the highest PageRank are all owned by Google.  The only exception to this pattern are a few countries such as China and Russia. Our results suggest that this dominance is strongly associated with country-specific political factors such as freedom of the press. Furthermore, our data confirms that Google still operates services on Chinese websites, despite its proclaimed retreat from the Chinese market. We also confirm that websites with highly privacy-critical content are less likely to contain trackers (60\\% vs 90\\% for other websites), even though the majority of them still do contain trackers.","PeriodicalId":405637,"journal":{"name":"J. Web Sci.","volume":"350 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Web Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/106.00000014","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

Abstract

We perform a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5~billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains. To the best of our knowledge, this constitutes the largest empirical web tracking dataset collected so far, and exceeds related studies by more than an order of magnitude in the number of  domains and web pages analyzed. Due to the enormous size of the dataset, we are able to perform a large-scale study of online tracking, on three levels: (1) On a global level, we give a precise figure for the extent of tracking, give insights into the structural properties of the `online tracking sphere' and analyse which trackers (and subsequently, which companies) are used by how many websites. (2) On a country-specific level, we analyse which trackers are used by websites in different countries, and identify the countries in which websites choose significantly different trackers than in the rest of the world. (3) We answer the question whether the content of websites influences the choice of trackers they use, leveraging more than ninety thousand categorized domains. In particular, we analyse whether highly privacy-critical websites about health and addiction make different choices of trackers than other websites. Based on the performed analyses, we confirm that trackers are widespread (as expected), and that a small number of trackers dominates the web (Google, Facebook and Twitter).  In particular, the three tracking domains with the highest PageRank are all owned by Google.  The only exception to this pattern are a few countries such as China and Russia. Our results suggest that this dominance is strongly associated with country-specific political factors such as freedom of the press. Furthermore, our data confirms that Google still operates services on Chinese websites, despite its proclaimed retreat from the Chinese market. We also confirm that websites with highly privacy-critical content are less likely to contain trackers (60\% vs 90\% for other websites), even though the majority of them still do contain trackers.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
关于无处不在的网络跟踪:来自十亿页网页抓取的见解
我们对万维网上的第三方跟踪器进行了大规模分析。我们从CommonCrawl 2012语料库中提取了超过35亿个网页的第三方嵌入,并将这些嵌入聚合到一个包含超过1.4亿个第三方嵌入的数据集,这些嵌入分布在超过4100万个领域。据我们所知,这构成了迄今为止收集到的最大的经验网络跟踪数据集,并且在分析的域和网页数量上超过了相关研究的一个数量级。由于数据集的巨大规模,我们能够在三个层面上对在线跟踪进行大规模研究:(1)在全球层面上,我们给出了跟踪范围的精确数字,深入了解“在线跟踪领域”的结构属性,并分析哪些跟踪器(以及随后哪些公司)被多少网站使用。(2)在国家层面上,我们分析了不同国家的网站使用的跟踪器,并确定了网站选择与世界其他地区显著不同的跟踪器的国家。(3)我们利用超过9万个分类域名,回答了网站内容是否影响他们使用跟踪器的选择的问题。特别是,我们分析了关于健康和成瘾的高度隐私关键网站是否与其他网站做出了不同的跟踪器选择。根据执行的分析,我们确认跟踪器是普遍的(如预期的),并且少数跟踪器主导了网络(谷歌,Facebook和Twitter)。特别值得一提的是,拥有最高PageRank的三个跟踪域名都属于谷歌。这种模式的唯一例外是中国和俄罗斯等少数国家。我们的研究结果表明,这种主导地位与特定国家的政治因素(如新闻自由)密切相关。此外,我们的数据证实,尽管谷歌宣布退出中国市场,但它仍在中国网站上运营服务。我们还确认,具有高度隐私关键内容的网站不太可能包含跟踪器(60% vs其他网站的90%),尽管其中大多数仍然包含跟踪器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exploring the Relationship between User Activities and Profile Images on Twitter through Machine Learning Techniques Multi-Cultural Interlinking of Web Taxonomies with ACROSS Identity Assurance in the UK: technical implementation and legal implications under eIDAS Towards Understanding the Consumption of Video-Ads on YouTube Predicting Online Islamophobic Behavior after #ParisAttacks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1