Classification of Phishing Email Using Word Embedding and Machine Learning Techniques

M. Somesha, A. R. Pais
{"title":"Classification of Phishing Email Using Word Embedding and Machine Learning Techniques","authors":"M. Somesha, A. R. Pais","doi":"10.13052/jcsm2245-1439.1131","DOIUrl":null,"url":null,"abstract":"Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used.","PeriodicalId":37820,"journal":{"name":"Journal of Cyber Security and Mobility","volume":"39 1","pages":"279-320"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cyber Security and Mobility","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13052/jcsm2245-1439.1131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 6

Abstract

Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于词嵌入和机器学习技术的网络钓鱼邮件分类
电子邮件网络钓鱼是一种网络攻击,给企业和商业组织带来巨大的经济损失。网络钓鱼邮件是一种特殊类型的垃圾邮件,用于欺骗用户泄露个人信息以访问其数字资产。网络钓鱼攻击通常是通过发送电子邮件链接到收集敏感信息的欺骗网站来触发的。APWG的调查表明,现有的对策仍然无效,不足以检测网络钓鱼攻击。因此,需要一种有效的机制来检测网络钓鱼电子邮件,以提供更好的安全性,防止普通用户受到此类攻击。现有的开源数据集的多样性有限,因此它们不能捕捉到攻击的真实情况。因此,需要实时输入数据集来设计准确的电子邮件反网络钓鱼解决方案。在目前的工作中,它已经创建了一个实时的内部网络钓鱼和合法电子邮件语料库,并提出了使用词嵌入和机器学习算法检测网络钓鱼电子邮件的有效技术。该系统仅使用四种基于邮件标题的启发式方法对电子邮件进行分类。提出的词嵌入和机器学习框架包括六种词嵌入技术和五种机器学习分类器,以评估最佳表现组合。在所有六种组合中,Random Forest在FastText (CBOW)上的表现一直最好,在使用的三个数据集上,其准确率为99.50%,假阳性率为0.053%,TF-IDF的准确率为99.39%,假阳性率为0.4%,Count Vectorizer的准确率为99.18%,假阳性率为0.98%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Cyber Security and Mobility
Journal of Cyber Security and Mobility Computer Science-Computer Networks and Communications
CiteScore
2.30
自引率
0.00%
发文量
10
期刊介绍: Journal of Cyber Security and Mobility is an international, open-access, peer reviewed journal publishing original research, review/survey, and tutorial papers on all cyber security fields including information, computer & network security, cryptography, digital forensics etc. but also interdisciplinary articles that cover privacy, ethical, legal, economical aspects of cyber security or emerging solutions drawn from other branches of science, for example, nature-inspired. The journal aims at becoming an international source of innovation and an essential reading for IT security professionals around the world by providing an in-depth and holistic view on all security spectrum and solutions ranging from practical to theoretical. Its goal is to bring together researchers and practitioners dealing with the diverse fields of cybersecurity and to cover topics that are equally valuable for professionals as well as for those new in the field from all sectors industry, commerce and academia. This journal covers diverse security issues in cyber space and solutions thereof. As cyber space has moved towards the wireless/mobile world, issues in wireless/mobile communications and those involving mobility aspects will also be published.
期刊最新文献
Network Malware Detection Using Deep Learning Network Analysis An Efficient Intrusion Detection and Prevention System for DDOS Attack in WSN Using SS-LSACNN and TCSLR Update Algorithm of Secure Computer Database Based on Deep Belief Network Malware Cyber Threat Intelligence System for Internet of Things (IoT) Using Machine Learning Deep Learning Based Hybrid Analysis of Malware Detection and Classification: A Recent Review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1