Development of Various Stacking Ensemble-Based HIDS Using ADFA Datasets

IF 6.3 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Open Journal of the Communications Society Pub Date : 2025-02-03 DOI:10.1109/OJCOMS.2025.3538101

Hami Satilmiş;Sedat Akleylek;Zaliha Yüce Tok

{"title":"Development of Various Stacking Ensemble-Based HIDS Using ADFA Datasets","authors":"Hami Satilmiş;Sedat Akleylek;Zaliha Yüce Tok","doi":"10.1109/OJCOMS.2025.3538101","DOIUrl":null,"url":null,"abstract":"The rapid increase in the number of cyber attacks and the emergence of various attack variations pose significant threats to the security of computer systems and networks. Various intrusion detection systems (IDS) are developed to defend computer systems and networks in response to these threats. One type of IDS, known as a host-based intrusion detection system (HIDS), focuses on securing a single host. Numerous HIDS have been proposed in the literature, incorporating various detection methods. This study develops multiple machine learning (ML) models and stacking ensemble based HIDS that can be used as detection methods in HIDS. Initially, n-grams, standard bag-of-words (BoW), binary BoW, probability BoW, and term frequency-inverse document frequency (TF-IDF) BoW methods are applied to the ADFA-LD and ADFA-WD datasets. Mutual information and k-means methods are used together for feature selection on the resulting BoW datasets. Individual models are created using either selected features or all features. Subsequently, the outputs of these individual models are used in extreme gradient boosting (XGBoost) and adaptive boosting (AdaBoost) models to develop stacking ensemble based models. The experimental results show that the best accuracy (ACC) among models using ADFA-LD based BoW datasets is achieved by the stacking ensemble based XGBoost model, which has an ACC of 0.9747. This XGBoost model utilizes the standard BoW dataset and selected features. Among models using ADFA-WD based BoW datasets, the stacking ensemble based XGBoost is also the most successful in terms of ACC, with an ACC of 0.9163, using the standard BoW dataset and all features.","PeriodicalId":33803,"journal":{"name":"IEEE Open Journal of the Communications Society","volume":"6 ","pages":"1170-1189"},"PeriodicalIF":6.3000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10870100","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Communications Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10870100/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid increase in the number of cyber attacks and the emergence of various attack variations pose significant threats to the security of computer systems and networks. Various intrusion detection systems (IDS) are developed to defend computer systems and networks in response to these threats. One type of IDS, known as a host-based intrusion detection system (HIDS), focuses on securing a single host. Numerous HIDS have been proposed in the literature, incorporating various detection methods. This study develops multiple machine learning (ML) models and stacking ensemble based HIDS that can be used as detection methods in HIDS. Initially, n-grams, standard bag-of-words (BoW), binary BoW, probability BoW, and term frequency-inverse document frequency (TF-IDF) BoW methods are applied to the ADFA-LD and ADFA-WD datasets. Mutual information and k-means methods are used together for feature selection on the resulting BoW datasets. Individual models are created using either selected features or all features. Subsequently, the outputs of these individual models are used in extreme gradient boosting (XGBoost) and adaptive boosting (AdaBoost) models to develop stacking ensemble based models. The experimental results show that the best accuracy (ACC) among models using ADFA-LD based BoW datasets is achieved by the stacking ensemble based XGBoost model, which has an ACC of 0.9747. This XGBoost model utilizes the standard BoW dataset and selected features. Among models using ADFA-WD based BoW datasets, the stacking ensemble based XGBoost is also the most successful in terms of ACC, with an ACC of 0.9163, using the standard BoW dataset and all features.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于ADFA数据集的各种堆叠集成HIDS的开发

网络攻击数量的迅速增加和各种攻击形式的出现对计算机系统和网络的安全构成了重大威胁。人们开发了各种入侵检测系统（IDS）来保护计算机系统和网络以应对这些威胁。一种IDS称为基于主机的入侵检测系统（HIDS），其重点是保护单个主机。文献中提出了许多HIDS，包括各种检测方法。本研究开发了多个机器学习（ML）模型和基于堆叠集成的HIDS，可作为HIDS中的检测方法。首先，将n-grams、标准词袋（BoW）、二进制BoW、概率BoW和术语频率-逆文档频率（TF-IDF） BoW方法应用于ADFA-LD和ADFA-WD数据集。互信息和k-means方法一起用于生成的BoW数据集的特征选择。使用选定的特征或所有特征创建单个模型。随后，将这些单独模型的输出用于极端梯度增强（XGBoost）和自适应增强（AdaBoost）模型，以开发基于叠加集成的模型。实验结果表明，基于叠加集成的XGBoost模型在基于ADFA-LD的BoW数据集模型中获得了最好的精度（ACC），其ACC为0.9747。这个XGBoost模型利用标准的BoW数据集和选定的特征。在使用基于ADFA-WD的BoW数据集的模型中，基于堆叠集成的XGBoost在ACC方面也是最成功的，在使用标准BoW数据集和所有特征的情况下，其ACC为0.9163。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Open Journal of the Communications Society Multiple-

CiteScore

13.70

自引率

3.80%

发文量

审稿时长

10 weeks

期刊介绍： The IEEE Open Journal of the Communications Society (OJ-COMS) is an open access, all-electronic journal that publishes original high-quality manuscripts on advances in the state of the art of telecommunications systems and networks. The papers in IEEE OJ-COMS are included in Scopus. Submissions reporting new theoretical findings (including novel methods, concepts, and studies) and practical contributions (including experiments and development of prototypes) are welcome. Additionally, survey and tutorial articles are considered. The IEEE OJCOMS received its debut impact factor of 7.9 according to the Journal Citation Reports (JCR) 2023. The IEEE Open Journal of the Communications Society covers science, technology, applications and standards for information organization, collection and transfer using electronic, optical and wireless channels and networks. Some specific areas covered include: Systems and network architecture, control and management Protocols, software, and middleware Quality of service, reliability, and security Modulation, detection, coding, and signaling Switching and routing Mobile and portable communications Terminals and other end-user devices Networks for content distribution and distributed computing Communications-based distributed resources control.