Extending limited datasets with GAN-like self-supervision for SMS spam detection

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Computers & Security Pub Date : 2024-10-01 Epub Date: 2024-07-14 DOI:10.1016/j.cose.2024.103998

Or Haim Anidjar , Revital Marbel , Ran Dubin , Amit Dvir , Chen Hajaj

{"title":"Extending limited datasets with GAN-like self-supervision for SMS spam detection","authors":"Or Haim Anidjar , Revital Marbel , Ran Dubin , Amit Dvir , Chen Hajaj","doi":"10.1016/j.cose.2024.103998","DOIUrl":null,"url":null,"abstract":"<div><p>Short Message Service (SMS) spamming is a harmful phishing attack on mobile phones. That is, fraudsters are trying to misuse personal user information, using tricky text messages, sometimes included with a fake URL that asks for this personal information, such as passwords, usernames, etc. In the world of Machine Learning, several approaches have tried to attitudinize this problem, but the lack of available data resources was commonly the main drawback towards a good enough solution. Therefore, in this paper, we suggest a dataset extension technique for small datasets, based on an Out Of Distribution (OOD) metric. Hence, different approaches such as Generative Adversarial Networks (GANs) were suggested, yet GANs are hard to train whenever datasets are limited in terms of sample size. In this paper, we present a GAN-like method that imitates the generator concept of GANs for the purpose of limited datasets extension, using the OOD concept. By using a sophisticated text generation method, we show how to apply it over datasets from the domain of fraud and spam detection in SMS messages, and achieve over 25% relative improvement, compared to two other solutions. In addition, due to the class imbalance in typical spam datasets, our approach is being examined over another dataset, in order to verify that the false alarm rate is low enough.</p></div>","PeriodicalId":51004,"journal":{"name":"Computers & Security","volume":"145 ","pages":"Article 103998"},"PeriodicalIF":5.4000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Security","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167404824003031","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/14 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Short Message Service (SMS) spamming is a harmful phishing attack on mobile phones. That is, fraudsters are trying to misuse personal user information, using tricky text messages, sometimes included with a fake URL that asks for this personal information, such as passwords, usernames, etc. In the world of Machine Learning, several approaches have tried to attitudinize this problem, but the lack of available data resources was commonly the main drawback towards a good enough solution. Therefore, in this paper, we suggest a dataset extension technique for small datasets, based on an Out Of Distribution (OOD) metric. Hence, different approaches such as Generative Adversarial Networks (GANs) were suggested, yet GANs are hard to train whenever datasets are limited in terms of sample size. In this paper, we present a GAN-like method that imitates the generator concept of GANs for the purpose of limited datasets extension, using the OOD concept. By using a sophisticated text generation method, we show how to apply it over datasets from the domain of fraud and spam detection in SMS messages, and achieve over 25% relative improvement, compared to two other solutions. In addition, due to the class imbalance in typical spam datasets, our approach is being examined over another dataset, in order to verify that the false alarm rate is low enough.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用类似于 GAN 的自监督功能扩展有限的数据集，以检测垃圾短信

短信息服务（SMS）垃圾邮件是一种针对手机的有害网络钓鱼攻击。也就是说，欺诈者试图利用刁钻的短信滥用用户个人信息，有时还会附上一个假冒的 URL，要求用户提供密码、用户名等个人信息。在机器学习领域，有几种方法试图解决这一问题，但缺乏可用数据资源通常是无法找到足够好的解决方案的主要障碍。因此，在本文中，我们提出了一种基于分布外（OOD）度量的小型数据集扩展技术。因此，我们提出了生成对抗网络（GANs）等不同方法，但只要数据集的样本量有限，GANs 就很难训练。在本文中，我们提出了一种类似 GAN 的方法，利用 OOD 概念，模仿 GAN 的生成器概念，用于有限数据集的扩展。通过使用一种复杂的文本生成方法，我们展示了如何将其应用于短信欺诈和垃圾邮件检测领域的数据集，并与其他两种解决方案相比，取得了超过 25% 的相对改进。此外，由于典型垃圾邮件数据集的类不平衡，我们正在对另一个数据集进行检验，以验证误报率足够低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers & Security 工程技术-计算机：信息系统

CiteScore

12.40

自引率

7.10%

发文量

365

审稿时长

10.7 months

期刊介绍： Computers & Security is the most respected technical journal in the IT security field. With its high-profile editorial board and informative regular features and columns, the journal is essential reading for IT security professionals around the world. Computers & Security provides you with a unique blend of leading edge research and sound practical management advice. It is aimed at the professional involved with computer security, audit, control and data integrity in all sectors - industry, commerce and academia. Recognized worldwide as THE primary source of reference for applied research and technical expertise it is your first step to fully secure systems.