Machine Learning Approach for Phishing Attack Detection

人工智能技术学报(英文) Pub Date : 2023-05-10 DOI:10.37965/jait.2023.0197

Tarun Choudhary, Siddhesh Mhapankar, Rohit Bhddha, Ashish Kharuk, Rohini Patil

{"title":"Machine Learning Approach for Phishing Attack Detection","authors":"Tarun Choudhary, Siddhesh Mhapankar, Rohit Bhddha, Ashish Kharuk, Rohini Patil","doi":"10.37965/jait.2023.0197","DOIUrl":null,"url":null,"abstract":"Phishing is the easiest method for gathering sensitive information from unwary people. Phishers seek to get private data including passwords, login information, and bank account details. Cyber security experts are actively seeking for trustworthy and effective ways to identify phishing websites. In order to distinguish between legal and phishing URLs, we used machine learning (ML) technology. In this research work using ML technology extraction and analysis of both types of URLs was performed. Extreme Gradient Boosting (XGBoost), Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM) were used to identify phishing websites. The goal was to identify phishing URLs and determine the most effective ML technique by comparing the accuracy rates of each algorithm. In this, proposed methodology two datasets were used. The accuracy of models was calculated on Phishtank and UCI dataset using kfold, feature selection and hyperparameter tuning method. Performance measures precision, recall and F1-score and Receiver Operating Characteristics (ROC) curve were calculated. RF provided an accuracy of 98.80% and 97.87% on the Phishtank dataset and UCI respectively. Highest precision, recall, F1-score value was 99% each and AUC-ROC value was 99.89% with Phishtank dataset. Validation with other researchers showed better results with proposed methodology. Therefore this methodology can be of help to identify phishing websites.","PeriodicalId":70996,"journal":{"name":"人工智能技术学报(英文)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"人工智能技术学报(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.37965/jait.2023.0197","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Phishing is the easiest method for gathering sensitive information from unwary people. Phishers seek to get private data including passwords, login information, and bank account details. Cyber security experts are actively seeking for trustworthy and effective ways to identify phishing websites. In order to distinguish between legal and phishing URLs, we used machine learning (ML) technology. In this research work using ML technology extraction and analysis of both types of URLs was performed. Extreme Gradient Boosting (XGBoost), Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM) were used to identify phishing websites. The goal was to identify phishing URLs and determine the most effective ML technique by comparing the accuracy rates of each algorithm. In this, proposed methodology two datasets were used. The accuracy of models was calculated on Phishtank and UCI dataset using kfold, feature selection and hyperparameter tuning method. Performance measures precision, recall and F1-score and Receiver Operating Characteristics (ROC) curve were calculated. RF provided an accuracy of 98.80% and 97.87% on the Phishtank dataset and UCI respectively. Highest precision, recall, F1-score value was 99% each and AUC-ROC value was 99.89% with Phishtank dataset. Validation with other researchers showed better results with proposed methodology. Therefore this methodology can be of help to identify phishing websites.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

网络钓鱼攻击检测的机器学习方法

网络钓鱼是从不小心的人那里收集敏感信息的最简单的方法。网络钓鱼者试图获取私人数据，包括密码、登录信息和银行账户详细信息。网络安全专家正在积极寻找可信和有效的方法来识别网络钓鱼网站。为了区分合法和钓鱼url，我们使用了机器学习(ML)技术。在本研究中，使用ML技术对这两种类型的url进行了提取和分析。使用极端梯度增强(XGBoost)、决策树(DT)、逻辑回归(LR)、随机森林(RF)和支持向量机(SVM)来识别钓鱼网站。目标是通过比较每种算法的准确率来识别网络钓鱼url并确定最有效的ML技术。在本研究中，采用了两个数据集。在Phishtank和UCI数据集上使用kfold、特征选择和超参数调优方法计算模型的精度。计算性能测量的精密度、召回率、f1评分和受试者工作特征(ROC)曲线。RF在Phishtank数据集和UCI上的准确率分别为98.80%和97.87%。Phishtank数据集的最高准确率、召回率、f1评分值均为99%，AUC-ROC值为99.89%。与其他研究人员的验证表明，所提出的方法取得了更好的结果。因此，这种方法可以帮助识别网络钓鱼网站。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

人工智能技术学报(英文)

CiteScore

8.70

自引率

0.00%

发文量