{"title":"Classification of Phishing Email Using Word Embedding and Machine Learning Techniques","authors":"M. Somesha, A. R. Pais","doi":"10.13052/jcsm2245-1439.1131","DOIUrl":null,"url":null,"abstract":"Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used.","PeriodicalId":37820,"journal":{"name":"Journal of Cyber Security and Mobility","volume":"39 1","pages":"279-320"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cyber Security and Mobility","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13052/jcsm2245-1439.1131","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 6
Abstract
Email phishing is a cyber-attack, bringing substantial financial damage to corporate and commercial organizations. A phishing email is a special type of spamming, used to trick the user to disclose personal information to access his digital assets. Phishing attack is generally triggered by emailing links to spoofed websites that collect sensitive information. The APWG survey suggests that the existing countermeasures remain ineffective and insufficient for detecting phishing attacks. Hence there is a need for an efficient mechanism to detect phishing emails to provide better security against such attacks to the common user. The existing open-source data sets are limited in diversity, hence they do not capture the real picture of the attack. Hence there is a need for real-time input data set to design accurate email anti-phishing solutions. In the current work, it has been created a real-time in-house corpus of phishing and legitimate emails and proposed efficient techniques to detect phishing emails using a word embedding and machine learning algorithms. The proposed system uses only four email header-based heuristics for the classification of emails. The proposed word embedding cum machine learning framework comprises six word embedding techniques with five machine learning classifiers to evaluate the best performing combination. Among all six combinations, Random Forest consistently performed the best with FastText (CBOW) by achieving an accuracy of 99.50% with a false positive rate of 0.053%, TF-IDF achieved an accuracy of 99.39% with a false positive rate of 0.4% and Count Vectorizer achieved an accuracy of 99.18% with a false positive rate of 0.98% respectively for three datasets used.
期刊介绍:
Journal of Cyber Security and Mobility is an international, open-access, peer reviewed journal publishing original research, review/survey, and tutorial papers on all cyber security fields including information, computer & network security, cryptography, digital forensics etc. but also interdisciplinary articles that cover privacy, ethical, legal, economical aspects of cyber security or emerging solutions drawn from other branches of science, for example, nature-inspired. The journal aims at becoming an international source of innovation and an essential reading for IT security professionals around the world by providing an in-depth and holistic view on all security spectrum and solutions ranging from practical to theoretical. Its goal is to bring together researchers and practitioners dealing with the diverse fields of cybersecurity and to cover topics that are equally valuable for professionals as well as for those new in the field from all sectors industry, commerce and academia. This journal covers diverse security issues in cyber space and solutions thereof. As cyber space has moved towards the wireless/mobile world, issues in wireless/mobile communications and those involving mobility aspects will also be published.