Improving Phishing Website Detection using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Journal of Web Engineering Pub Date : 2023-03-01 DOI:10.13052/jwe1540-9589.2237

Luka Jovanovic;Dijana Jovanovic;Milos Antonijevic;Bosko Nikolic;Nebojsa Bacanin;Miodrag Zivkovic;Ivana Strumberger

{"title":"Improving Phishing Website Detection using a Hybrid Two-level Framework for Feature Selection and XGBoost Tuning","authors":"Luka Jovanovic;Dijana Jovanovic;Milos Antonijevic;Bosko Nikolic;Nebojsa Bacanin;Miodrag Zivkovic;Ivana Strumberger","doi":"10.13052/jwe1540-9589.2237","DOIUrl":null,"url":null,"abstract":"In the last few decades, the World Wide Web has become a necessity that offers numerous services to end users. The number of online transactions increases daily, as well as that of malicious actors. Machine learning plays a vital role in the majority of modern solutions. To further improve Web security, this paper proposes a hybrid approach based on the eXtreme Gradient Boosting (XGBoost) machine learning model optimized by an improved version of the well-known metaheuristics algorithm. In this research, the improved firefly algorithm is employed in the two-tier framework, which was also developed as part of the research, to perform both the feature selection and adjustment of the XGBoost hyper-parameters. The performance of the introduced hybrid model is evaluated against three instances of well-known publicly available phishing website datasets. The performance of novel introduced algorithms is additionally compared against cutting-edge metaheuristics that are utilized in the same framework. The first two datasets were provided by Mendeley Data, while the third was acquired from the University of California, Irvine machine learning repository. Additionally, the best performing models have been subjected to SHapley Additive exPlanations (SHAP) analysis to determine the impact of each feature on model decisions. The obtained results suggest that the proposed hybrid solution achieves a superior performance level in comparison to other approaches, and that it represents a perspective solution in the domain of web security.","PeriodicalId":49952,"journal":{"name":"Journal of Web Engineering","volume":"22 3","pages":"543-574"},"PeriodicalIF":1.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/10243554/10243555/10247501.pdf","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Web Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10247501/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 1

Abstract

In the last few decades, the World Wide Web has become a necessity that offers numerous services to end users. The number of online transactions increases daily, as well as that of malicious actors. Machine learning plays a vital role in the majority of modern solutions. To further improve Web security, this paper proposes a hybrid approach based on the eXtreme Gradient Boosting (XGBoost) machine learning model optimized by an improved version of the well-known metaheuristics algorithm. In this research, the improved firefly algorithm is employed in the two-tier framework, which was also developed as part of the research, to perform both the feature selection and adjustment of the XGBoost hyper-parameters. The performance of the introduced hybrid model is evaluated against three instances of well-known publicly available phishing website datasets. The performance of novel introduced algorithms is additionally compared against cutting-edge metaheuristics that are utilized in the same framework. The first two datasets were provided by Mendeley Data, while the third was acquired from the University of California, Irvine machine learning repository. Additionally, the best performing models have been subjected to SHapley Additive exPlanations (SHAP) analysis to determine the impact of each feature on model decisions. The obtained results suggest that the proposed hybrid solution achieves a superior performance level in comparison to other approaches, and that it represents a perspective solution in the domain of web security.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用用于特征选择和XGBoost调整的混合两级框架改进钓鱼网站检测

在过去的几十年里，万维网已经成为向最终用户提供大量服务的必需品。在线交易的数量每天都在增加，恶意行为者的数量也在增加。机器学习在大多数现代解决方案中发挥着至关重要的作用。为了进一步提高Web安全性，本文提出了一种基于极限梯度提升（XGBoost）机器学习模型的混合方法，该模型通过著名元启发式算法的改进版本进行了优化。在本研究中，改进的萤火虫算法被用于双层框架中，该框架也是作为研究的一部分开发的，用于执行XGBoost超参数的特征选择和调整。针对三个已知的公开可用的钓鱼网站数据集实例，对引入的混合模型的性能进行了评估。此外，还将新引入的算法的性能与在同一框架中使用的尖端元启发式算法进行了比较。前两个数据集由Mendeley Data提供，第三个数据集来自加州大学欧文分校的机器学习库。此外，对性能最好的模型进行了SHapley加性预测（SHAP）分析，以确定每个特征对模型决策的影响。所获得的结果表明，与其他方法相比，所提出的混合解决方案实现了更高的性能水平，并且它代表了网络安全领域的一个前瞻性解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Web Engineering 工程技术-计算机：理论方法

CiteScore

1.80

自引率

12.50%

发文量

审稿时长

9 months

期刊介绍： The World Wide Web and its associated technologies have become a major implementation and delivery platform for a large variety of applications, ranging from simple institutional information Web sites to sophisticated supply-chain management systems, financial applications, e-government, distance learning, and entertainment, among others. Such applications, in addition to their intrinsic functionality, also exhibit the more complex behavior of distributed applications.