Spam Filtering with KNN: Investigation of the Effect of k Value on Classification Performance

Durmuş Özkan Şahin, Sercan Demirci
{"title":"Spam Filtering with KNN: Investigation of the Effect of k Value on Classification Performance","authors":"Durmuş Özkan Şahin, Sercan Demirci","doi":"10.1109/SIU49456.2020.9302516","DOIUrl":null,"url":null,"abstract":"In this study, it is aimed to filter spam e-mails by using machine learning and text mining techniques. K-Nearest Neighbor (KNN) algorithm which is one of the techniques of machine learning is used. KNN algorithm is an easy to use and high performance classification algorithm. But the main problem of this algorithm is what will be the k value at the beginning. The performance of the algorithm changes according to the selected k value. In this study, three different data sets are discussed. These are Enron, Ling-Spam and SMSSpam-Collection data sets. Firstly, basic text mining techniques and term frequency–inverse document frequency (TF-IDF) term weighting method are applied to all data sets. By, according to the Chi-Square feature selection method, the best 500 attributes are selected and given to KNN algorithm. Finally, extensive experiments are carried out by giving the values of 1, 3, 5, 7 and 9 to the k value of the algorithm. In all three data sets, the most successful result is obtained when k is 1. The most successful results obtained from Ling-Spam, Enron and SMSSpam-Collection data sets according to F-measure are 0:9324, 0:9215 and 0:9196 respectively.","PeriodicalId":312627,"journal":{"name":"2020 28th Signal Processing and Communications Applications Conference (SIU)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th Signal Processing and Communications Applications Conference (SIU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIU49456.2020.9302516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

In this study, it is aimed to filter spam e-mails by using machine learning and text mining techniques. K-Nearest Neighbor (KNN) algorithm which is one of the techniques of machine learning is used. KNN algorithm is an easy to use and high performance classification algorithm. But the main problem of this algorithm is what will be the k value at the beginning. The performance of the algorithm changes according to the selected k value. In this study, three different data sets are discussed. These are Enron, Ling-Spam and SMSSpam-Collection data sets. Firstly, basic text mining techniques and term frequency–inverse document frequency (TF-IDF) term weighting method are applied to all data sets. By, according to the Chi-Square feature selection method, the best 500 attributes are selected and given to KNN algorithm. Finally, extensive experiments are carried out by giving the values of 1, 3, 5, 7 and 9 to the k value of the algorithm. In all three data sets, the most successful result is obtained when k is 1. The most successful results obtained from Ling-Spam, Enron and SMSSpam-Collection data sets according to F-measure are 0:9324, 0:9215 and 0:9196 respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于KNN的垃圾邮件过滤:k值对分类性能影响的研究
在这项研究中,它旨在通过使用机器学习和文本挖掘技术来过滤垃圾邮件。使用了机器学习技术之一的k -最近邻(KNN)算法。KNN算法是一种简单、高效的分类算法。但是这个算法的主要问题是开始时k的值是多少。该算法的性能根据所选择的k值而变化。在本研究中,讨论了三种不同的数据集。这些是安然、凌凌垃圾邮件和SMSSpam-Collection数据集。首先,将基本的文本挖掘技术和术语频率-逆文档频率(TF-IDF)术语加权方法应用于所有数据集;根据卡方特征选择方法,选出500个最优属性,并将其交给KNN算法。最后,对算法的k值分别赋值1、3、5、7、9,进行了大量的实验。在所有三个数据集中,当k为1时获得最成功的结果。根据F-measure, Ling-Spam、Enron和SMSSpam-Collection数据集获得的最成功结果分别为0:924、0:9215和0:9196。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Skin Lesion Classification With Deep CNN Ensembles Design of a New System for Upper Extremity Movement Ability Assessment Stock Market Prediction with Stacked Autoencoder Based Feature Reduction Segmentation networks reinforced with attribute profiles for large scale land-cover map production Encoded Deep Features for Visual Place Recognition
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1