An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing

Siti-Hajar-Aminah Ali, S. Ozawa, J. Nakazato, Tao Ban, Jumpei Shimamura
{"title":"An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing","authors":"Siti-Hajar-Aminah Ali, S. Ozawa, J. Nakazato, Tao Ban, Jumpei Shimamura","doi":"10.4236/JILSA.2015.72005","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.","PeriodicalId":69452,"journal":{"name":"智能学习系统与应用(英文)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"智能学习系统与应用(英文)","FirstCategoryId":"1093","ListUrlMain":"https://doi.org/10.4236/JILSA.2015.72005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于位置敏感哈希的资源分配网络在线恶意垃圾邮件检测系统
本文提出了一种新的在线系统,该系统可以快速检测恶意垃圾邮件,并通过每日更新来适应邮件内容的变化和指向恶意网站的统一资源定位符(URL)链接。我们为服务器引入了一个自动生成训练样例的功能,其中自动收集双跳邮件,并由爬虫类软件给出其类标签,以分析网站恶意,称为SPIKE。一般情况下,由于垃圾邮件发送者利用僵尸网络在短时间内传播大量的恶意邮件,这种分布的垃圾邮件往往具有相同或相似的内容。因此,没有必要对所有的垃圾邮件进行学习。为了快速适应新的恶意活动,只需要选择新的垃圾邮件类型进行学习,这可以通过在分类器模型中引入主动学习方案来实现。为此,我们采用local Sensitive hash (lan - lsh)资源分配网络作为具有数据选择功能的分类器模型。在lan -LSH中,对于已经学习到的相同或相似的垃圾邮件,通过局部敏感哈希(local Sensitive hash, LSH)快速搜索到一个哈希表,其中位于“良好学习”的匹配的相似邮件将被丢弃,而不作为训练数据。为了分析电子邮件内容,我们采用词包(BoW)方法,生成特征向量,特征向量的属性根据归一化词频-逆文档频率(TF-IDF)进行变换。我们使用日本国立信息通信技术研究所(NICT)从2013年3月1日至2013年5月10日收集的双反弹垃圾邮件数据集来评估所提出系统的性能。结果表明,所提出的垃圾邮件检测系统具有较高的检测率和检测能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
135
期刊最新文献
Architecting the Metaverse: Blockchain and the Financial and Legal Regulatory Challenges of Virtual Real Estate A Proposed Meta-Reality Immersive Development Pipeline: Generative AI Models and Extended Reality (XR) Content for the Metaverse A Comparison of PPO, TD3 and SAC Reinforcement Algorithms for Quadruped Walking Gait Generation Multiple Collaborative Service Model and System Construction Based on Industrial Competitive Intelligence Skin Cancer Classification Using Transfer Learning by VGG16 Architecture (Case Study on Kaggle Dataset)
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1