Determining Resampling Ratios Using BSMOTE and SVM-SMOTE for Identifying Rare Attacks in Imbalanced Cybersecurity Data

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers Pub Date : 2023-10-11 DOI:10.3390/computers12100204

Sikha S. Bagui, Dustin Mink, Subhash C. Bagui, Sakthivel Subramaniam

{"title":"Determining Resampling Ratios Using BSMOTE and SVM-SMOTE for Identifying Rare Attacks in Imbalanced Cybersecurity Data","authors":"Sikha S. Bagui, Dustin Mink, Subhash C. Bagui, Sakthivel Subramaniam","doi":"10.3390/computers12100204","DOIUrl":null,"url":null,"abstract":"Machine Learning is widely used in cybersecurity for detecting network intrusions. Though network attacks are increasing steadily, the percentage of such attacks to actual network traffic is significantly less. And here lies the problem in training Machine Learning models to enable them to detect and classify malicious attacks from routine traffic. The ratio of actual attacks to benign data is significantly high and as such forms highly imbalanced datasets. In this work, we address this issue using data resampling techniques. Though there are several oversampling and undersampling techniques available, how these oversampling and undersampling techniques are most effectively used is addressed in this paper. Two oversampling techniques, Borderline SMOTE and SVM-SMOTE, are used for oversampling minority data and random undersampling is used for undersampling majority data. Both the oversampling techniques use KNN after selecting a random minority sample point, hence the impact of varying KNN values on the performance of the oversampling technique is also analyzed. Random Forest is used for classification of the rare attacks. This work is done on a widely used cybersecurity dataset, UNSW-NB15, and the results show that 10% oversampling gives better results for both BMSOTE and SVM-SMOTE.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"17 1","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12100204","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Machine Learning is widely used in cybersecurity for detecting network intrusions. Though network attacks are increasing steadily, the percentage of such attacks to actual network traffic is significantly less. And here lies the problem in training Machine Learning models to enable them to detect and classify malicious attacks from routine traffic. The ratio of actual attacks to benign data is significantly high and as such forms highly imbalanced datasets. In this work, we address this issue using data resampling techniques. Though there are several oversampling and undersampling techniques available, how these oversampling and undersampling techniques are most effectively used is addressed in this paper. Two oversampling techniques, Borderline SMOTE and SVM-SMOTE, are used for oversampling minority data and random undersampling is used for undersampling majority data. Both the oversampling techniques use KNN after selecting a random minority sample point, hence the impact of varying KNN values on the performance of the oversampling technique is also analyzed. Random Forest is used for classification of the rare attacks. This work is done on a widely used cybersecurity dataset, UNSW-NB15, and the results show that 10% oversampling gives better results for both BMSOTE and SVM-SMOTE.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用BSMOTE和SVM-SMOTE确定重采样比识别不平衡网络安全数据中的罕见攻击

机器学习被广泛应用于网络安全领域，用于检测网络入侵。尽管网络攻击正在稳步增加，但此类攻击占实际网络流量的比例明显较低。这里的问题在于训练机器学习模型，使它们能够从日常流量中检测和分类恶意攻击。实际攻击与良性数据的比例非常高，因此形成了高度不平衡的数据集。在这项工作中，我们使用数据重采样技术解决了这个问题。虽然有几种可用的过采样和欠采样技术，但如何最有效地使用这些过采样和欠采样技术是本文的重点。两种过采样技术:Borderline SMOTE和SVM-SMOTE用于过采样少数数据，随机欠采样用于欠采样多数数据。两种过采样技术都是在随机选择少数样本点后使用KNN，因此还分析了不同KNN值对过采样技术性能的影响。随机森林用于罕见攻击的分类。这项工作是在一个广泛使用的网络安全数据集UNSW-NB15上完成的，结果表明，10%的过采样对BMSOTE和SVM-SMOTE都有更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊