Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services

Pelayo Vallina, V. Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, O. Hohlfeld, J. Tapiador, N. Vallina-Rodriguez
{"title":"Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services","authors":"Pelayo Vallina, V. Pochat, Álvaro Feal, Marius Paraschiv, Julien Gamba, Tim Burke, O. Hohlfeld, J. Tapiador, N. Vallina-Rodriguez","doi":"10.1145/3419394.3423660","DOIUrl":null,"url":null,"abstract":"Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90% to below 1%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research.","PeriodicalId":255324,"journal":{"name":"Proceedings of the ACM Internet Measurement Conference","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Internet Measurement Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3419394.3423660","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

Abstract

Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90% to below 1%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
形状、错误、不匹配:领域分类服务分析
域分类服务在多个领域都有应用,包括网络安全、内容拦截和定向广告。然而,这些服务在分类领域的方法方面往往是一个黑盒子,这使得很难评估它们的优势、对特定应用程序的适用性和局限性。在这项工作中,我们对超过4.4万个主机名上的13个流行域名分类服务进行了大规模分析。我们的研究从经验上探讨了它们的方法、可扩展性限制、标签星座,以及它们对学术研究和其他实际应用(如内容过滤)的适用性。我们发现,各个医疗机构的覆盖率差异很大,从90%以上到1%以下不等。所有服务都偏离了它们的文档分类法,妨碍了对研究的合理使用。此外,提供商之间的标签高度不一致,他们在域上几乎没有一致,这使得比较或组合这些服务变得困难。我们还展示了众包工作的动态如何受到可扩展性和覆盖方面的阻碍,以及人类标注者之间的主观分歧。最后,通过案例研究,我们展示了大多数服务不适合检测用于研究或内容屏蔽目的的专门内容。最后,我们根据我们的经验见解和经验,对它们的使用提出了可行的建议。我们特别关注用户应该如何处理在技术解决方案和研究中观察到的不同服务之间的显著差异。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Lumos5G A Bird's Eye View of the World's Fastest Networks Quantifying the Impact of Blocklisting in the Age of Address Reuse TopoScope No WAN's Land: Mapping U.S. Broadband Coverage with Millions of Address Queries to ISPs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1