Clustering as feature selection method in spam classification: uncovering sick-leave sellers

IF 12.3 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Applied Computing and Informatics Pub Date : 2021-12-14 DOI:10.1108/aci-09-2021-0248
M. Elhussein, Samiha Brahimi
{"title":"Clustering as feature selection method in spam classification: uncovering sick-leave sellers","authors":"M. Elhussein, Samiha Brahimi","doi":"10.1108/aci-09-2021-0248","DOIUrl":null,"url":null,"abstract":"PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.","PeriodicalId":37348,"journal":{"name":"Applied Computing and Informatics","volume":null,"pages":null},"PeriodicalIF":12.3000,"publicationDate":"2021-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/aci-09-2021-0248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

PurposeThis paper aims to propose a novel way of using textual clustering as a feature selection method. It is applied to identify the most important keywords in the profile classification. The method is demonstrated through the problem of sick-leave promoters on Twitter.Design/methodology/approachFour machine learning classifiers were used on a total of 35,578 tweets posted on Twitter. The data were manually labeled into two categories: promoter and nonpromoter. Classification performance was compared when the proposed clustering feature selection approach and the standard feature selection were applied.FindingsRadom forest achieved the highest accuracy of 95.91% higher than similar work compared. Furthermore, using clustering as a feature selection method improved the Sensitivity of the model from 73.83% to 98.79%. Sensitivity (recall) is the most important measure of classifier performance when detecting promoters’ accounts that have spam-like behavior.Research limitations/implicationsThe method applied is novel, more testing is needed in other datasets before generalizing its results.Practical implicationsThe model applied can be used by Saudi authorities to report on the accounts that sell sick-leaves online.Originality/valueThe research is proposing a new way textual clustering can be used in feature selection.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
聚类作为垃圾邮件分类的特征选择方法:发现病假卖家
目的提出一种新的文本聚类特征选择方法。它用于识别概要分类中最重要的关键字。该方法通过Twitter上的病假推动者问题进行了验证。设计/方法/方法在Twitter上发布的总共35,578条推文上使用了四个机器学习分类器。这些数据被手工标记为两类:启动子和非启动子。比较了本文提出的聚类特征选择方法和标准特征选择方法的分类性能。结果:radom forest的准确率最高,达到95.91%,高于同类工作。此外,使用聚类作为特征选择方法将模型的灵敏度从73.83%提高到98.79%。灵敏度(召回率)是分类器在检测具有垃圾邮件行为的推广者账户时最重要的性能指标。研究限制/启示应用的方法是新颖的,在推广其结果之前,需要在其他数据集中进行更多的测试。实际意义应用的模型可以被沙特当局用来报告网上销售病假的账户。本研究提出了一种将文本聚类用于特征选择的新方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Applied Computing and Informatics
Applied Computing and Informatics Computer Science-Information Systems
CiteScore
12.20
自引率
0.00%
发文量
0
审稿时长
39 weeks
期刊介绍: Applied Computing and Informatics aims to be timely in disseminating leading-edge knowledge to researchers, practitioners and academics whose interest is in the latest developments in applied computing and information systems concepts, strategies, practices, tools and technologies. In particular, the journal encourages research studies that have significant contributions to make to the continuous development and improvement of IT practices in the Kingdom of Saudi Arabia and other countries. By doing so, the journal attempts to bridge the gap between the academic and industrial community, and therefore, welcomes theoretically grounded, methodologically sound research studies that address various IT-related problems and innovations of an applied nature. The journal will serve as a forum for practitioners, researchers, managers and IT policy makers to share their knowledge and experience in the design, development, implementation, management and evaluation of various IT applications. Contributions may deal with, but are not limited to: • Internet and E-Commerce Architecture, Infrastructure, Models, Deployment Strategies and Methodologies. • E-Business and E-Government Adoption. • Mobile Commerce and their Applications. • Applied Telecommunication Networks. • Software Engineering Approaches, Methodologies, Techniques, and Tools. • Applied Data Mining and Warehousing. • Information Strategic Planning and Recourse Management. • Applied Wireless Computing. • Enterprise Resource Planning Systems. • IT Education. • Societal, Cultural, and Ethical Issues of IT. • Policy, Legal and Global Issues of IT. • Enterprise Database Technology.
期刊最新文献
Gender variability in machine learning based subcortical neuroimaging for Parkinson’s disease diagnosis ChatGPT-powered deep learning: elevating brain tumor detection in MRI scans Bi-directional adaptive enhanced A* algorithm for mobile robot navigation Interca: an R library implementing “automatic” interpretation of results of multiple correspondence analysis (MCA) Wine quality assessment through lightweight deep learning: integrating 1D-CNN and LSTM for analyzing electronic nose VOCs signals
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1