基于生物启发算法的下采样方法和集合学习用于 Twitter 垃圾邮件检测

IF 1 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE International Journal of Uncertainty Fuzziness and Knowledge-Based Systems Pub Date : 2024-02-20 DOI:10.1142/s0218488524500016

K. Kiruthika Devi, G. A. Sathish Kumar

{"title":"基于生物启发算法的下采样方法和集合学习用于 Twitter 垃圾邮件检测","authors":"K. Kiruthika Devi, G. A. Sathish Kumar","doi":"10.1142/s0218488524500016","DOIUrl":null,"url":null,"abstract":"<p>Currently, social media networks such as Facebook and Twitter have evolved into valuable platforms for global communication. However, due to their extensive user bases, Twitter is often misused by illegitimate users engaging in illicit activities. While there are numerous research papers available that delve into combating illegitimate users on Twitter, a common shortcoming in most of these works is the failure to address the issue of class imbalance, which significantly impacts the effectiveness of spam detection. Few other research works that have addressed class imbalance have not yet applied bio-inspired algorithms to balance the dataset. Therefore, we introduce PSOB-U, a particle swarm optimization-based undersampling technique designed to balance the Twitter dataset. In PSOB-U, various classifiers and metrics are employed to select majority samples and rank them. Furthermore, an ensemble learning approach is implemented to combine the base classifiers in three stages. During the training phase of the base classifiers, undersampling techniques and a cost-sensitive random forest (CS-RF) are utilized to address the imbalanced data at both the data and algorithmic levels. In the first stage, imbalanced datasets are balanced using random undersampling, particle swarm optimization-based undersampling, and random oversampling. In the second stage, a classifier is constructed for each of the balanced datasets obtained through these sampling techniques. In the third stage, a majority voting method is introduced to aggregate the predicted outputs from the three classifiers. The evaluation results demonstrate that our proposed method significantly enhances the detection of illegitimate users in the imbalanced Twitter dataset. Additionally, we compare our proposed work with existing models, and the predicted results highlight the superiority of our spam detection model over state-of-the-art spam detection models that address the class imbalance problem. The combination of particle swarm optimization-based undersampling and the ensemble learning approach using majority voting results in more accurate spam detection.</p>","PeriodicalId":50283,"journal":{"name":"International Journal of Uncertainty Fuzziness and Knowledge-Based Systems","volume":"136 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection\",\"authors\":\"K. Kiruthika Devi, G. A. Sathish Kumar\",\"doi\":\"10.1142/s0218488524500016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Currently, social media networks such as Facebook and Twitter have evolved into valuable platforms for global communication. However, due to their extensive user bases, Twitter is often misused by illegitimate users engaging in illicit activities. While there are numerous research papers available that delve into combating illegitimate users on Twitter, a common shortcoming in most of these works is the failure to address the issue of class imbalance, which significantly impacts the effectiveness of spam detection. Few other research works that have addressed class imbalance have not yet applied bio-inspired algorithms to balance the dataset. Therefore, we introduce PSOB-U, a particle swarm optimization-based undersampling technique designed to balance the Twitter dataset. In PSOB-U, various classifiers and metrics are employed to select majority samples and rank them. Furthermore, an ensemble learning approach is implemented to combine the base classifiers in three stages. During the training phase of the base classifiers, undersampling techniques and a cost-sensitive random forest (CS-RF) are utilized to address the imbalanced data at both the data and algorithmic levels. In the first stage, imbalanced datasets are balanced using random undersampling, particle swarm optimization-based undersampling, and random oversampling. In the second stage, a classifier is constructed for each of the balanced datasets obtained through these sampling techniques. In the third stage, a majority voting method is introduced to aggregate the predicted outputs from the three classifiers. The evaluation results demonstrate that our proposed method significantly enhances the detection of illegitimate users in the imbalanced Twitter dataset. Additionally, we compare our proposed work with existing models, and the predicted results highlight the superiority of our spam detection model over state-of-the-art spam detection models that address the class imbalance problem. The combination of particle swarm optimization-based undersampling and the ensemble learning approach using majority voting results in more accurate spam detection.</p>\",\"PeriodicalId\":50283,\"journal\":{\"name\":\"International Journal of Uncertainty Fuzziness and Knowledge-Based Systems\",\"volume\":\"136 1\",\"pages\":\"\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Uncertainty Fuzziness and Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1142/s0218488524500016\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Uncertainty Fuzziness and Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1142/s0218488524500016","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

目前，Facebook 和 Twitter 等社交媒体网络已发展成为全球交流的重要平台。然而，由于用户基础广泛，Twitter 经常被从事非法活动的非法用户滥用。虽然有许多研究论文深入探讨了如何打击 Twitter 上的非法用户，但大多数研究都存在一个共同的缺陷，那就是没有解决类不平衡问题，而这个问题严重影响了垃圾邮件检测的效果。其他极少数解决了类不平衡问题的研究还没有应用生物启发算法来平衡数据集。因此，我们引入了 PSOB-U，这是一种基于粒子群优化的欠采样技术，旨在平衡 Twitter 数据集。在 PSOB-U 中，我们采用了各种分类器和指标来选择多数样本并对其进行排序。此外，PSOB-U 还采用了一种集合学习方法，分三个阶段组合基础分类器。在基础分类器的训练阶段，利用欠采样技术和成本敏感随机森林（CS-RF）来解决数据和算法层面的不平衡数据问题。在第一阶段，使用随机欠采样、基于粒子群优化的欠采样和随机过采样来平衡不平衡数据集。在第二阶段，为通过这些采样技术获得的每个平衡数据集构建分类器。在第三阶段，引入多数投票法来汇总三个分类器的预测输出。评估结果表明，我们提出的方法大大提高了在不平衡 Twitter 数据集中对非法用户的检测能力。此外，我们还将所提出的工作与现有模型进行了比较，预测结果凸显了我们的垃圾邮件检测模型优于解决类不平衡问题的最先进垃圾邮件检测模型。基于粒子群优化的欠采样与使用多数投票的集合学习方法相结合，可实现更准确的垃圾邮件检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Bio-Inspired Algorithm Based Undersampling Approach and Ensemble Learning for Twitter Spam Detection

Currently, social media networks such as Facebook and Twitter have evolved into valuable platforms for global communication. However, due to their extensive user bases, Twitter is often misused by illegitimate users engaging in illicit activities. While there are numerous research papers available that delve into combating illegitimate users on Twitter, a common shortcoming in most of these works is the failure to address the issue of class imbalance, which significantly impacts the effectiveness of spam detection. Few other research works that have addressed class imbalance have not yet applied bio-inspired algorithms to balance the dataset. Therefore, we introduce PSOB-U, a particle swarm optimization-based undersampling technique designed to balance the Twitter dataset. In PSOB-U, various classifiers and metrics are employed to select majority samples and rank them. Furthermore, an ensemble learning approach is implemented to combine the base classifiers in three stages. During the training phase of the base classifiers, undersampling techniques and a cost-sensitive random forest (CS-RF) are utilized to address the imbalanced data at both the data and algorithmic levels. In the first stage, imbalanced datasets are balanced using random undersampling, particle swarm optimization-based undersampling, and random oversampling. In the second stage, a classifier is constructed for each of the balanced datasets obtained through these sampling techniques. In the third stage, a majority voting method is introduced to aggregate the predicted outputs from the three classifiers. The evaluation results demonstrate that our proposed method significantly enhances the detection of illegitimate users in the imbalanced Twitter dataset. Additionally, we compare our proposed work with existing models, and the predicted results highlight the superiority of our spam detection model over state-of-the-art spam detection models that address the class imbalance problem. The combination of particle swarm optimization-based undersampling and the ensemble learning approach using majority voting results in more accurate spam detection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

2.70

自引率

0.00%

发文量

审稿时长

13.5 months

期刊介绍： The International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems is a forum for research on various methodologies for the management of imprecise, vague, uncertain or incomplete information. The aim of the journal is to promote theoretical or methodological works dealing with all kinds of methods to represent and manipulate imperfectly described pieces of knowledge, excluding results on pure mathematics or simple applications of existing theoretical results. It is published bimonthly, with worldwide distribution to researchers, engineers, decision-makers, and educators.