Comparison of Machine Learning Algorithms for Spam Detection

IF 0.9 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of Advances in Information Technology Pub Date : 2023-01-01 DOI:10.12720/jait.14.2.178-184

Azeema Sadia, Fatima Bashir, Reema Qaiser Khan, Ammarah Khalid

{"title":"Comparison of Machine Learning Algorithms for Spam Detection","authors":"Azeema Sadia, Fatima Bashir, Reema Qaiser Khan, Ammarah Khalid","doi":"10.12720/jait.14.2.178-184","DOIUrl":null,"url":null,"abstract":"—The Internet is used as a tool to offer people with endless knowledge. It is a global platform which is used for connectivity, communication, and sharing. At almost no cost, an individual can use the Internet to send email messages, update tweets, and Facebook messages to a vast number of people. These messages can also contain unsolicited advertisement which is identified as a spam. The company Twitter too is massively affected by spamming and it is an alarming issue for them. Twitter considers spam as actions that are unsolicited and repeated. These include tweet repetition, and the URLs that lead users to completely unrelated websites. The authors’ have worked with twitter’s dataset focusing on tweets about “iPhone”. It was collected by using an API which was further pre-processed. In this paper, content-based features have been selected that recognize the spamming tweet by using R. Multiple machine learning algorithms were applied to detect spamming tweets: Naive Bayes, Logistic Regression, KNN, Decision Tree, and Support Vector Machine. It was observed that the best performance was achieved by Naive Bayes Algorithm giving an accuracy of 89%.","PeriodicalId":36452,"journal":{"name":"Journal of Advances in Information Technology","volume":"1 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Advances in Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12720/jait.14.2.178-184","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

—The Internet is used as a tool to offer people with endless knowledge. It is a global platform which is used for connectivity, communication, and sharing. At almost no cost, an individual can use the Internet to send email messages, update tweets, and Facebook messages to a vast number of people. These messages can also contain unsolicited advertisement which is identified as a spam. The company Twitter too is massively affected by spamming and it is an alarming issue for them. Twitter considers spam as actions that are unsolicited and repeated. These include tweet repetition, and the URLs that lead users to completely unrelated websites. The authors’ have worked with twitter’s dataset focusing on tweets about “iPhone”. It was collected by using an API which was further pre-processed. In this paper, content-based features have been selected that recognize the spamming tweet by using R. Multiple machine learning algorithms were applied to detect spamming tweets: Naive Bayes, Logistic Regression, KNN, Decision Tree, and Support Vector Machine. It was observed that the best performance was achieved by Naive Bayes Algorithm giving an accuracy of 89%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

垃圾邮件检测的机器学习算法比较

互联网被用作一种工具，为人们提供无尽的知识。它是一个用于连接、通信和共享的全球平台。几乎不需要任何费用，个人可以使用互联网向大量的人发送电子邮件、更新twitter消息和Facebook消息。这些消息还可能包含未经请求的广告，这些广告被识别为垃圾邮件。Twitter公司也受到垃圾邮件的严重影响，这对他们来说是一个令人担忧的问题。Twitter将垃圾邮件视为未经请求且重复的行为。这些包括推文的重复，以及将用户引向完全不相关的网站的url。作者使用了twitter的数据集，专注于关于“iPhone”的推文。使用API收集数据，并对其进行进一步预处理。本文选择了基于内容的特征来识别垃圾推文。采用朴素贝叶斯、逻辑回归、KNN、决策树和支持向量机等多种机器学习算法来检测垃圾推文。观察到，朴素贝叶斯算法获得了最好的性能，准确率为89%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Advances in Information Technology Computer Science-Information Systems

CiteScore

4.20

自引率

20.00%

发文量