利用连续词袋和随机森林检测垃圾邮件

Ranah Research : Journal of Multidisciplinary Research and Development Pub Date : 2024-06-05 DOI:10.38035/rrj.v6i4.873

Michiavelly Rustam, Agung Brotokuncoro, Rusdianto Roestam

{"title":"利用连续词袋和随机森林检测垃圾邮件","authors":"Michiavelly Rustam, Agung Brotokuncoro, Rusdianto Roestam","doi":"10.38035/rrj.v6i4.873","DOIUrl":null,"url":null,"abstract":"Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.","PeriodicalId":333433,"journal":{"name":"Ranah Research : Journal of Multidisciplinary Research and Development","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest\",\"authors\":\"Michiavelly Rustam, Agung Brotokuncoro, Rusdianto Roestam\",\"doi\":\"10.38035/rrj.v6i4.873\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.\",\"PeriodicalId\":333433,\"journal\":{\"name\":\"Ranah Research : Journal of Multidisciplinary Research and Development\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ranah Research : Journal of Multidisciplinary Research and Development\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.38035/rrj.v6i4.873\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ranah Research : Journal of Multidisciplinary Research and Development","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.38035/rrj.v6i4.873","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

垃圾电子邮件构成了严重的网络威胁，因为骗子会使用各种手段欺骗个人泄露敏感信息或下载有害内容。例如，2023 年 6 月，印尼遭遇了约 651 万次垃圾邮件攻击，凸显了这一问题的广泛性。这些攻击经常采用欺骗策略，如冒充或虚假奖励承诺，诱骗毫无戒心的受害者。屈服于垃圾邮件可能会导致经济损失和其他严重后果。为了解决这一问题，本研究通过对电子邮件内容进行分类来检测网络钓鱼企图，从而解决这一紧迫问题。所提出的解决方案利用了运行时平台（如 Google Colab），并使用了连续词袋（CBOW）分析和随机森林方法。之所以选择 CBOW，是因为它能有效捕捉词与词之间的语义关系，使模型能从电子邮件内容中提取有意义的特征。另一方面，选择随机森林是因为它能够处理电子邮件分类任务中常见的不平衡数据集，确保在模型训练过程中公平地代表垃圾邮件和火腿邮件。通过将这两种技术相结合，我们旨在开发一种强大的分类模型，能够准确区分网络钓鱼（垃圾邮件）和合法（垃圾邮件）电子邮件，从而加强电子邮件安全措施。通过我们的方法，我们的目标是将 SpamAssassin 数据集分为火腿或垃圾邮件类别，预期精确率为 0.98，从而证明该模型在准确识别网络钓鱼电子邮件方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Deteksi Email Spam dengan Continuous Bag-Of-Words dan Random Forest

Spam email poses a significant cyber threat, as scammers employ various tactics to deceive individuals into divulging sensitive information or downloading harmful content. For instance, in June 2023, Indonesia encountered approximately 6.51 thousand spam attacks, underscoring the widespread nature of this issue. These attacks frequently involve deceptive strategies, such as impersonation or false promises of rewards, to ensnare unsuspecting victims. Succumbing to spam can result in financial losses and other grave repercussions. To address this concern, this research addresses this pressing problem by focusing on email content classification to detect phishing attempts. The proposed solution leverages runtime platforms such as Google Colab and uses Continuous Bag of Words (CBOW) analysis and Random Forest methods. CBOW is selected for its effectiveness in capturing semantic relationships between words, allowing the model to extract meaningful features from the email content. Random Forest, on the other hand, is chosen for its ability to handle imbalanced datasets commonly encountered in email classification tasks, ensuring fair representation of both spam and ham emails during model training. By combining these two techniques, we aim to develop a robust classification model capable of accurately distinguishing between phishing (spam) and legitimate (ham) emails, thus enhancing email security measures. Through our approach, we aim to classify the SpamAssassin dataset into ham or spam categories, with an anticipated precision rate of 0.98, demonstrating the model's effectiveness in accurately identifying phishing emails.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ranah Research : Journal of Multidisciplinary Research and Development

自引率

0.00%

发文量