通过机器学习从垃圾邮件中提取文本文档来检测恶意软件

Proceedings of the 22nd ACM Symposium on Document Engineering Pub Date : 2022-09-20 DOI:10.1145/3558100.3563854

Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez

{"title":"通过机器学习从垃圾邮件中提取文本文档来检测恶意软件","authors":"Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez","doi":"10.1145/3558100.3563854","DOIUrl":null,"url":null,"abstract":"Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available \"Spam Email Malware Detection - 600\" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detecting malware using text documents extracted from spam email through machine learning\",\"authors\":\"Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez\",\"doi\":\"10.1145/3558100.3563854\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available \\\"Spam Email Malware Detection - 600\\\" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.\",\"PeriodicalId\":146244,\"journal\":{\"name\":\"Proceedings of the 22nd ACM Symposium on Document Engineering\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM Symposium on Document Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3558100.3563854\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3558100.3563854","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

垃圾邮件已成为网络犯罪分子传播恶意软件的有效途径。虽然网络安全机构和公司开发产品和组织课程，让人们检测恶意垃圾邮件模式，但垃圾邮件攻击还没有完全避免。在这项工作中，我们提出并公开了“垃圾邮件恶意软件检测-600”(SEMD-600)，这是一个基于Bruce Guenter的新数据集，用于仅使用电子邮件文本检测垃圾邮件中的恶意软件。我们还介绍了一种基于传统自然语言处理(NLP)技术的恶意软件检测管道。使用SEMD-600，我们比较了文本表示技术词袋和词频逆文档频率(TF-IDF)，结合三种不同的监督分类器:支持向量机，朴素贝叶斯和逻辑回归，以检测纯文本文档中的恶意软件。我们发现TF-IDF与Logistic回归相结合的效果最好，宏观F1得分为0.763。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Detecting malware using text documents extracted from spam email through machine learning

Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available "Spam Email Malware Detection - 600" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd ACM Symposium on Document Engineering

自引率

0.00%

发文量