MIM: A multiple integration model for intrusion detection on imbalanced samples

World Wide Web Pub Date : 2024-07-10 DOI:10.1007/s11280-024-01285-0

Zhiqiang Zhang, Le Wang, Junyi Zhu, Dong Zhu, Zhaoquan Gu, Yanchun Zhang

{"title":"MIM: A multiple integration model for intrusion detection on imbalanced samples","authors":"Zhiqiang Zhang, Le Wang, Junyi Zhu, Dong Zhu, Zhaoquan Gu, Yanchun Zhang","doi":"10.1007/s11280-024-01285-0","DOIUrl":null,"url":null,"abstract":"<p>The quantity of normal samples is commonly significantly greater than that of malicious samples, resulting in an imbalance in network security data. When dealing with imbalanced samples, the classification model requires careful sampling and attribute selection methods to cope with bias towards majority classes. Simple data sampling methods and incomplete feature selection techniques cannot improve the accuracy of intrusion detection models. In addition, a single intrusion detection model cannot accurately classify all attack types in the face of massive imbalanced security data. Nevertheless, the existing model integration methods based on stacking or voting technologies suffer from high coupling that undermines their stability and reliability. To address these issues, we propose a Multiple Integration Model (MIM) to implement feature selection and attack classification. First, MIM uses random Oversampling, random Undersampling and Washing Methods (OUWM) to reconstruct the data. Then, a modified simulated annealing algorithm is employed to generate candidate features. Finally, an integrated model based on Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost) and gradient Boosting with Categorical features support (CatBoost) is designed to achieve intrusion detection and attack classification. MIM leverages a Rule-based and Priority-based Ensemble Strategy (RPES) to combine the high accuracy of the former and the high effectiveness of the latter two, improving the stability and reliability of the integration model. We evaluate the effectiveness of our approach on two publicly available intrusion detection datasets, as well as a dataset created by researchers from the University of New Brunswick and another dataset collected by the Australian Center for Cyber Security. In our experiments, MIM significantly outperforms several existing intrusion detection models in terms of accuracy. Specifically, compared to two recently proposed methods, namely, the reinforcement learning method based on the adaptive sample distribution dual-experience replay pool mechanism (ASD2ER) and the method that combines Auto Encoder, Principal Component Analysis, and Long Short-Term Memory (AE+PCA+LSTM), MIM exhibited a respective enhancement in intrusion detection accuracy by 1.35% and 1.16%.</p>","PeriodicalId":501180,"journal":{"name":"World Wide Web","volume":"71 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Wide Web","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s11280-024-01285-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The quantity of normal samples is commonly significantly greater than that of malicious samples, resulting in an imbalance in network security data. When dealing with imbalanced samples, the classification model requires careful sampling and attribute selection methods to cope with bias towards majority classes. Simple data sampling methods and incomplete feature selection techniques cannot improve the accuracy of intrusion detection models. In addition, a single intrusion detection model cannot accurately classify all attack types in the face of massive imbalanced security data. Nevertheless, the existing model integration methods based on stacking or voting technologies suffer from high coupling that undermines their stability and reliability. To address these issues, we propose a Multiple Integration Model (MIM) to implement feature selection and attack classification. First, MIM uses random Oversampling, random Undersampling and Washing Methods (OUWM) to reconstruct the data. Then, a modified simulated annealing algorithm is employed to generate candidate features. Finally, an integrated model based on Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost) and gradient Boosting with Categorical features support (CatBoost) is designed to achieve intrusion detection and attack classification. MIM leverages a Rule-based and Priority-based Ensemble Strategy (RPES) to combine the high accuracy of the former and the high effectiveness of the latter two, improving the stability and reliability of the integration model. We evaluate the effectiveness of our approach on two publicly available intrusion detection datasets, as well as a dataset created by researchers from the University of New Brunswick and another dataset collected by the Australian Center for Cyber Security. In our experiments, MIM significantly outperforms several existing intrusion detection models in terms of accuracy. Specifically, compared to two recently proposed methods, namely, the reinforcement learning method based on the adaptive sample distribution dual-experience replay pool mechanism (ASD2ER) and the method that combines Auto Encoder, Principal Component Analysis, and Long Short-Term Memory (AE+PCA+LSTM), MIM exhibited a respective enhancement in intrusion detection accuracy by 1.35% and 1.16%.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MIM：用于不平衡样本入侵检测的多重集成模型

正常样本的数量通常远远大于恶意样本的数量，从而导致网络安全数据的不平衡。在处理不平衡样本时，分类模型需要谨慎的采样和属性选择方法，以应对偏向多数类别的情况。简单的数据采样方法和不完整的特征选择技术无法提高入侵检测模型的准确性。此外，面对大量不平衡的安全数据，单一的入侵检测模型无法准确地对所有攻击类型进行分类。然而，现有的基于堆叠或投票技术的模型集成方法存在耦合度高的问题，影响了其稳定性和可靠性。为了解决这些问题，我们提出了一种多重集成模型（MIM）来实现特征选择和攻击分类。首先，MIM 使用随机过采样、随机欠采样和清洗方法（OUWM）来重建数据。然后，采用改进的模拟退火算法生成候选特征。最后，设计了一个基于轻梯度提升机（LightGBM）、极端梯度提升（XGBoost）和支持分类特征的梯度提升（CatBoost）的集成模型，以实现入侵检测和攻击分类。MIM 利用基于规则和优先级的集合策略 (RPES)，将前者的高准确性和后者的高效性结合起来，提高了集成模型的稳定性和可靠性。我们在两个公开的入侵检测数据集、新不伦瑞克大学研究人员创建的数据集和澳大利亚网络安全中心收集的另一个数据集上评估了我们方法的有效性。在我们的实验中，MIM 在准确性方面明显优于现有的几种入侵检测模型。具体来说，与最近提出的两种方法（即基于自适应样本分布双经验重放池机制的强化学习方法（ASD2ER）和结合了自动编码器、主成分分析和长短期记忆（AE+PCA+LSTM）的方法）相比，MIM 的入侵检测准确率分别提高了 1.35% 和 1.16%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

World Wide Web

自引率

0.00%

发文量