CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

Q Engineering ACM Transactions on Information and System Security Pub Date : 2011-09-01 DOI:10.1145/2019599.2019606

Guang Xiang, Jason I. Hong, C. Rosé, L. Cranor

{"title":"CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites","authors":"Guang Xiang, Jason I. Hong, C. Rosé, L. Cranor","doi":"10.1145/2019599.2019606","DOIUrl":null,"url":null,"abstract":"Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms.\n Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate.\n We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.","PeriodicalId":50912,"journal":{"name":"ACM Transactions on Information and System Security","volume":"37 1","pages":"21:1-21:28"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"480","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information and System Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2019599.2019606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q","JCRName":"Engineering","Score":null,"Total":0}

引用次数: 480

Abstract

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms. Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate. We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

CANTINA+:一个功能丰富的机器学习框架，用于检测钓鱼网站

网络钓鱼是网络空间的瘟疫。通常，网络钓鱼检测方法要么使用人工验证的URL黑名单，要么通过机器学习技术利用Web页面特征。然而，前者在新的网络钓鱼方面很脆弱，而后者则存在有效特征缺乏和假阳性率(FP)高的问题。为了缓解这些问题，我们提出了一种分层的反网络钓鱼解决方案，旨在(1)利用机器学习的丰富特征集的表现力来实现对新型网络钓鱼的高真阳性率(TP)，以及(2)通过过滤算法将FP限制在较低的水平。具体来说，我们提出了CANTINA+，这是文献中最全面的基于特征的方法，包括八个新特征，它利用HTML文档对象模型(DOM)、搜索引擎和第三方服务以及机器学习技术来检测网络钓鱼。此外，我们设计了两个过滤器来帮助减少FP并实现运行时加速。第一种是近重复网络钓鱼检测器，它使用散列来捕获高度相似的网络钓鱼。第二个是登录表单过滤器，它直接将没有标识的登录表单的Web页面分类为合法的。我们用两种方法在不同的语料库上对CANTINA+进行了广泛的评估，其中包含8118个钓鱼网站和4883个合法网页。在随机评估中，CANTINA+对唯一测试网络钓鱼的TP值超过92%，对近重复测试网络钓鱼的TP值超过99%，对10%的训练网络钓鱼的TP值约为0.4%。在基于时间的评估中，CANTINA+在唯一测试网络钓鱼上也达到了92%以上的TP，在近重复测试网络钓鱼上达到了99%以上的TP，在两周的滑动窗口下，在20%的训练网络钓鱼下达到了1.4%的FP。能够达到0.4%的FP和超过92%的TP，我们的CANTINA+已被证明是一个有竞争力的反网络钓鱼解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Information and System Security 工程技术-计算机：信息系统

CiteScore

4.50

自引率

0.00%

发文量

审稿时长

3.3 months

期刊介绍： ISSEC is a scholarly, scientific journal that publishes original research papers in all areas of information and system security, including technologies, systems, applications, and policies.

期刊最新文献

An Efficient User Verification System Using Angle-Based Mouse Movement Biometrics A New Framework for Privacy-Preserving Aggregation of Time-Series Data Behavioral Study of Users When Interacting with Active Honeytokens Model Checking Distributed Mandatory Access Control Policies Randomization-Based Intrusion Detection System for Advanced Metering Infrastructure*