网络钓鱼邮件检测中自然语言处理和机器学习方法的比较

Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis
{"title":"网络钓鱼邮件检测中自然语言处理和机器学习方法的比较","authors":"Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis","doi":"10.1145/3465481.3469205","DOIUrl":null,"url":null,"abstract":"Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.","PeriodicalId":417395,"journal":{"name":"Proceedings of the 16th International Conference on Availability, Reliability and Security","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection\",\"authors\":\"Panagiotis Bountakas, Konstantinos Koutroumpouchos, C. Xenakis\",\"doi\":\"10.1145/3465481.3469205\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.\",\"PeriodicalId\":417395,\"journal\":{\"name\":\"Proceedings of the 16th International Conference on Availability, Reliability and Security\",\"volume\":\"102 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-08-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3465481.3469205\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3465481.3469205","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

摘要

网络钓鱼是最常用的恶意攻击,攻击者通常通过电子邮件冒充受信任的人或实体,从受害者那里获取私人信息。尽管网络钓鱼电子邮件攻击几十年来一直是一种已知的网络犯罪策略,但由于COVID-19大流行,它们的使用范围在过去几年中有所扩大,攻击者利用人们的恐慌来引诱受害者。因此,在网络钓鱼邮件检测领域还需要进一步的研究。最近的网络钓鱼电子邮件检测解决方案从电子邮件正文中提取具有代表性的基于文本的特征,已被证明是解决这些威胁的适当策略。本文提出了一种比较自然语言处理(TF-IDF, Word2Vec和BERT)和机器学习(随机森林,决策树,逻辑回归,梯度增强树和朴素贝叶斯)方法在网络钓鱼电子邮件检测中的组合使用方法。评估是在两个数据集上进行的,一个是平衡的,一个是不平衡的,这两个数据集都由来自著名的安然语料库的电子邮件和来自Nazario网络钓鱼语料库的最新电子邮件组成。平衡数据集的最佳组合是Word2Vec与随机森林算法的结合,而不平衡数据集的最佳组合是Word2Vec与Logistic回归算法的结合。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection
Phishing is the most-used malicious attempt in which attackers, commonly via emails, impersonate trusted persons or entities to obtain private information from a victim. Even though phishing email attacks are a known cybercriminal strategy for decades, their usage has been expanded over last couple of years due to the COVID-19 pandemic, where attackers exploit people’s consternation to lure victims. Therefore, further research is needed in the phishing email detection field. Recent phishing email detection solutions that extract representational text-based features from the email’s body have proved to be an appropriate strategy to tackle these threats. This paper proposes a comparison approach for the combined usage of Natural Language Processing (TF-IDF, Word2Vec, and BERT) and Machine Learning (Random Forest, Decision Tree, Logistic Regression, Gradient Boosting Trees, and Naive Bayes) methods for phishing email detection. The evaluation was performed on two datasets, one balanced and one imbalanced, both of which were comprised of emails from the well-known Enron corpus and the most recent emails from the Nazario phishing corpus. The best combination in the balanced dataset proved to be the Word2Vec with the Random Forest algorithm, while in the imbalanced dataset the Word2Vec with the Logistic Regression algorithm.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fighting organized crime by automatically detecting money laundering-related financial transactions Template Protected Authentication based on Location History and b-Bit MinHash Structuring a Cybersecurity Curriculum for Non-IT Employees of Micro- and Small Enterprises Privacy in Times of COVID-19: A Pilot Study in the Republic of Ireland Location Security under Reference Signals’ Spoofing Attacks: Threat Model and Bounds
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1