Identifying Sensitive URLs at Web-Scale

S. Matic, Costas Iordanou, Georgios Smaragdakis, Nikolaos Laoutaris
{"title":"Identifying Sensitive URLs at Web-Scale","authors":"S. Matic, Costas Iordanou, Georgios Smaragdakis, Nikolaos Laoutaris","doi":"10.1145/3419394.3423653","DOIUrl":null,"url":null,"abstract":"Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.","PeriodicalId":255324,"journal":{"name":"Proceedings of the ACM Internet Measurement Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Internet Measurement Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419394.3423653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 15

Abstract

Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive web pages with third-party cookies, 87% of the third-parties set at least one persistent cookie.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在web级识别敏感url
一些数据保护法包括保护与宗教、健康、性取向和其他敏感类别有关的个人数据的特别条款。有一个明确定义的敏感类别清单就足以手动提交投诉、进行调查和在法庭上起诉案件。然而,数据保护法并没有明确定义每个敏感类别下的内容类型。因此,目前尚不清楚如何实施主动措施,如通知用户,阻止跟踪器,并在用户访问敏感域名时自动提交投诉。为了支持这样的用例,我们转向Curlie.org众包分类法项目,用于绘制训练数据,为敏感url构建文本分类器。我们证明了我们的分类器能够以88%以上的准确率识别敏感的url,甚至能够以90%以上的准确率识别特定的敏感类别。然后,我们使用我们的分类器在Common Crawl项目收集的10亿个url语料库中搜索敏感url。我们在超过400万个域名中识别超过1.55亿个敏感url。尽管它们具有敏感性,但超过30%的这些url属于未使用HTTPS的域名。此外,在使用第三方cookie的敏感网页中,87%的第三方设置了至少一个持久cookie。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Lumos5G A Bird's Eye View of the World's Fastest Networks Quantifying the Impact of Blocklisting in the Age of Address Reuse TopoScope No WAN's Land: Mapping U.S. Broadband Coverage with Millions of Address Queries to ISPs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1