A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ

Razieh Nokhbeh Zaeem, K. S. Barber
{"title":"A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ","authors":"Razieh Nokhbeh Zaeem, K. S. Barber","doi":"10.1145/3422337.3447827","DOIUrl":null,"url":null,"abstract":"Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.","PeriodicalId":187272,"journal":{"name":"Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3422337.3447827","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Studies have shown website privacy policies are too long and hard to comprehend for their target audience. These studies and a more recent body of research that utilizes machine learning and natural language processing to automatically summarize privacy policies greatly benefit, if not rely on, corpora of privacy policies collected from the web. While there have been smaller annotated corpora of web privacy policies made public, we are not aware of any large publicly available corpus. We use DMOZ, a massive open-content directory of the web, and its manually categorized 1.5 million websites, to collect hundreds of thousands of privacy policies associated with their categories, enabling research on privacy policies across different categories/market sectors. We review the statistics of this corpus and make it available for research. We also obtain valuable insights about privacy policies, e.g., which websites post them less often. Our corpus of web privacy policies is a valuable tool at the researchers' disposal to investigate privacy policies. For example, it facilitates comparison among different methods of privacy policy summarization by providing a benchmark, and can be used in unsupervised machine learning to summarize privacy policies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于DMOZ的大型公开网站隐私政策语料库
研究表明,网站的隐私政策太长,很难让目标受众理解。这些研究以及最近利用机器学习和自然语言处理来自动总结隐私政策的研究,即使不依赖于从网络上收集的隐私政策语料库,也会极大地受益。虽然已经有一些较小的网络隐私政策注释语料库公开,但我们还没有发现任何大型的公开语料库。我们使用DMOZ,一个大型的网络开放内容目录,它手动分类了150万个网站,收集了数十万个与其类别相关的隐私政策,从而可以研究不同类别/市场部门的隐私政策。我们回顾了这个语料库的统计数据,并使其可供研究。我们还获得了有关隐私政策的宝贵见解,例如,哪些网站发布的频率较低。我们的网络隐私政策语料库是研究人员调查隐私政策的宝贵工具。例如,它通过提供基准来方便不同隐私策略总结方法之间的比较,并且可以在无监督机器学习中使用来总结隐私策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Quantum Obfuscation: Quantum Predicates with Entangled qubits When Models Learn Too Much Adaptive Fingerprinting: Website Fingerprinting over Few Encrypted Traffic Brittle Features of Device Authentication Session details: Session 2: Blockchains, Digital Currency
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1