Prioritized active learning for malicious URL detection using weighted text-based features

S. Bhattacharjee, A. Talukder, E. Al-Shaer, Pratik Doshi
{"title":"Prioritized active learning for malicious URL detection using weighted text-based features","authors":"S. Bhattacharjee, A. Talukder, E. Al-Shaer, Pratik Doshi","doi":"10.1109/ISI.2017.8004883","DOIUrl":null,"url":null,"abstract":"Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.","PeriodicalId":423696,"journal":{"name":"2017 IEEE International Conference on Intelligence and Security Informatics (ISI)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Intelligence and Security Informatics (ISI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2017.8004883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

Data analytics is being increasingly used in cyber-security problems, and found to be useful in cases where data volumes and heterogeneity make it cumbersome for manual assessment by security experts. In practical cyber-security scenarios involving data-driven analytics, obtaining data with annotations (i.e. ground-truth labels) is a challenging and known limiting factor for many supervised security analytics task. Significant portions of the large datasets typically remain unlabelled, as the task of annotation is extensively manual and requires a huge amount of expert intervention. In this paper, we propose an effective active learning approach that can efficiently address this limitation in a practical cyber-security problem of Phishing categorization, whereby we use a human-machine collaborative approach to design a semi-supervised solution. An initial classifier is learnt on a small amount of the annotated data which in an iterative manner, is then gradually updated by shortlisting only relevant samples from the large pool of unlabelled data that are most likely to influence the classifier performance fast. Prioritized Active Learning shows a significant promise to achieve faster convergence in terms of the classification performance in a batch learning framework, and thus requiring even lesser effort for human annotation. An useful feature weight update technique combined with active learning shows promising classification performance for categorizing Phishing/malicious URLs without requiring a large amount of annotated training samples to be available during training. In experiments with several collections of PhishMonger's Targeted Brand dataset, the proposed method shows significant improvement over the baseline by as much as 12%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
优先主动学习恶意URL检测使用加权文本为基础的特征
数据分析越来越多地用于网络安全问题,并且在数据量和异质性使安全专家难以进行人工评估的情况下,数据分析非常有用。在涉及数据驱动分析的实际网络安全场景中,对于许多有监督的安全分析任务来说,获取带有注释(即ground-truth标签)的数据是一个具有挑战性和已知的限制因素。大型数据集的重要部分通常是未标记的,因为注释的任务是广泛手动的,需要大量的专家干预。在本文中,我们提出了一种有效的主动学习方法,可以有效地解决网络钓鱼分类的实际网络安全问题中的这一限制,即我们使用人机协作方法来设计半监督解决方案。最初的分类器是在少量的注释数据上学习的,然后以迭代的方式,通过从最有可能快速影响分类器性能的大量未标记数据中筛选出相关样本来逐步更新。优先主动学习显示了在批处理学习框架中实现更快的分类性能收敛的重大承诺,因此需要更少的人工注释工作。一种有用的特征权重更新技术与主动学习相结合,在对钓鱼/恶意url进行分类时显示出很好的分类性能,而不需要在训练期间提供大量带注释的训练样本。在PhishMonger的目标品牌数据集的几个集合的实验中,所提出的方法在基线上显示出高达12%的显着改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The dynamics of health sentiments with competitive interactions in social media Phishing detection: A recent intelligent machine learning comparison based on models content and features A framework for digital forensics analysis based on semantic role labeling Alignment-free indexing-first-one hashing with bloom filter integration Assessing medical device vulnerabilities on the Internet of Things
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1