Exploring the Long Tail of (Malicious) Software Downloads

Babak Rahbarinia, Marco Balduzzi, R. Perdisci
{"title":"Exploring the Long Tail of (Malicious) Software Downloads","authors":"Babak Rahbarinia, Marco Balduzzi, R. Perdisci","doi":"10.1109/DSN.2017.19","DOIUrl":null,"url":null,"abstract":"In this paper, we present a large-scale study of global trends in software download events, with an analysis of both benign and malicious downloads, and a categorization of events for which no ground truth is currently available. Our measurement study is based on a unique, real-world dataset collected at Trend Micro containing more than 3 million in-the-wild web-based software download events involving hundreds of thousands of Internet machines, collected over a period of seven months. Somewhat surprisingly, we found that despite our best efforts and the use of multiple sources of ground truth, more than 83% of all downloaded software files remain unknown, i.e. cannot be classified as benign or malicious, even two years after they were first observed. If we consider the number of machines that have downloaded at least one unknown file, we find that more than 69% of the entire machine/user population downloaded one or more unknown software file. Because the accuracy of malware detection systems reported in the academic literature is typically assessed only over software files that can be labeled, our findings raise concerns on their actual effectiveness in large-scale real-world deployments, and on their ability to defend the majority of Internet machines from infection. To better understand what these unknown software files may be, we perform a detailed analysis of their properties. We then explore whether it is possible to extend the labeling of software downloads by building a rule-based system that automatically learns from the available ground truth and can be used to identify many more benign and malicious files with very high confidence. This allows us to greatly expand the number of software files that can be labeled with high confidence, thus providing results that can benefit the evaluation of future malware detection systems.","PeriodicalId":426928,"journal":{"name":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSN.2017.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

In this paper, we present a large-scale study of global trends in software download events, with an analysis of both benign and malicious downloads, and a categorization of events for which no ground truth is currently available. Our measurement study is based on a unique, real-world dataset collected at Trend Micro containing more than 3 million in-the-wild web-based software download events involving hundreds of thousands of Internet machines, collected over a period of seven months. Somewhat surprisingly, we found that despite our best efforts and the use of multiple sources of ground truth, more than 83% of all downloaded software files remain unknown, i.e. cannot be classified as benign or malicious, even two years after they were first observed. If we consider the number of machines that have downloaded at least one unknown file, we find that more than 69% of the entire machine/user population downloaded one or more unknown software file. Because the accuracy of malware detection systems reported in the academic literature is typically assessed only over software files that can be labeled, our findings raise concerns on their actual effectiveness in large-scale real-world deployments, and on their ability to defend the majority of Internet machines from infection. To better understand what these unknown software files may be, we perform a detailed analysis of their properties. We then explore whether it is possible to extend the labeling of software downloads by building a rule-based system that automatically learns from the available ground truth and can be used to identify many more benign and malicious files with very high confidence. This allows us to greatly expand the number of software files that can be labeled with high confidence, thus providing results that can benefit the evaluation of future malware detection systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
探索(恶意)软件下载的长尾
在本文中,我们对软件下载事件的全球趋势进行了大规模研究,对良性和恶意下载进行了分析,并对目前没有真实情况的事件进行了分类。我们的测量研究是基于趋势科技收集的一个独特的、真实的数据集,其中包含超过300万的基于网络的软件下载事件,涉及数十万台互联网机器,收集时间为7个月。有些令人惊讶的是,我们发现,尽管我们尽了最大的努力,并使用了多种来源的地面真相,但超过83%的下载软件文件仍然未知,即无法归类为良性或恶意,甚至在它们首次被发现两年后。如果我们考虑至少下载了一个未知文件的机器数量,我们发现超过69%的整个机器/用户群体下载了一个或多个未知软件文件。因为在学术文献中报告的恶意软件检测系统的准确性通常只对可以标记的软件文件进行评估,我们的研究结果引起了人们对它们在大规模现实世界部署中的实际有效性的关注,以及它们保护大多数互联网机器免受感染的能力。为了更好地了解这些未知的软件文件可能是什么,我们对它们的属性进行了详细的分析。然后,我们探索是否有可能通过建立一个基于规则的系统来扩展软件下载的标签,该系统可以自动从可用的基础事实中学习,并且可以非常高的置信度用于识别更多的良性和恶意文件。这使我们能够极大地扩展可以高置信度标记的软件文件的数量,从而提供有利于评估未来恶意软件检测系统的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Compromising Security of Economic Dispatch in Power System Operations Implicit Smartphone User Authentication with Sensors and Contextual Machine Learning Towards Automated Discovery of Crash-Resistant Primitives in Binary Executables Sensor-Based Implicit Authentication of Smartphone Users Athena: A Framework for Scalable Anomaly Detection in Software-Defined Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1