Relevance feature selection with data cleaning for intrusion detection system

S. Suthaharan, T. Panchagnula
{"title":"Relevance feature selection with data cleaning for intrusion detection system","authors":"S. Suthaharan, T. Panchagnula","doi":"10.1109/SECON.2012.6196965","DOIUrl":null,"url":null,"abstract":"Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD'99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD'99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD'99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD'99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD'99-based datasets will benefit from the good understanding of network features and their influences to attacks.","PeriodicalId":187091,"journal":{"name":"2012 Proceedings of IEEE Southeastcon","volume":"93 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 Proceedings of IEEE Southeastcon","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SECON.2012.6196965","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32

Abstract

Labeled datasets play a major role in the process of validating and evaluating machine learning techniques in intrusion detection systems. In order to obtain good accuracy in the evaluation, very large datasets should be considered. Intrusion traffic and normal traffic are in general dependent on a large number of network characteristics called features. However not all of these features contribute to the traffic characteristics. Therefore, eliminating the non-contributing features from the datasets, to facilitate speed and accuracy to the evaluation of machine learning techniques, becomes an important requirement. In this paper we suggest an approach which analyzes the intrusion datasets, evaluates the features for its relevance to a specific attack, determines the level of contribution of feature, and eliminates it from the dataset automatically. We adopt the Rough Set Theory (RST) based approach and select relevance features using multidimensional scatter-plot automatically. A pair-wise feature selection process is adopted to simplify. In our previous research we used KDD'99 dataset and validated the RST based approach. There are lots of redundant data entries in KDD'99 and thus the machine learning techniques are biased towards most occurring events. This property leads the algorithms to ignore less frequent events which can be more harmful than most occurring events. False positives are another important drawback in KDD'99 dataset. In this paper, we adopt NSL-KDD dataset (an improved version of KDD'99 dataset) and validate the automated RST based approach. The approach presented in this paper leads to a selection of most relevance features and we expect that the intrusion detection research using KDD'99-based datasets will benefit from the good understanding of network features and their influences to attacks.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于数据清洗的入侵检测系统关联特征选择
标记数据集在验证和评估入侵检测系统中机器学习技术的过程中起着重要作用。为了在评价中获得良好的准确性,需要考虑非常大的数据集。入侵流量和正常流量一般都依赖于大量被称为特征的网络特征。然而,并非所有这些特征都有助于流量特征。因此,从数据集中消除非贡献特征,以促进机器学习技术评估的速度和准确性,成为一个重要的要求。本文提出了一种分析入侵数据集的方法,评估特征与特定攻击的相关性,确定特征的贡献水平,并自动从数据集中消除特征。我们采用基于粗糙集理论(RST)的方法,利用多维散点图自动选择相关特征。采用成对特征选择过程进行简化。在我们之前的研究中,我们使用了KDD'99数据集并验证了基于RST的方法。在KDD'99中有许多冗余的数据条目,因此机器学习技术偏向于大多数发生的事件。这一特性导致算法忽略不太频繁的事件,而这些事件可能比大多数发生的事件更有害。假阳性是KDD'99数据集的另一个重要缺点。本文采用NSL-KDD数据集(KDD'99数据集的改进版本),验证了基于自动RST的方法。本文提出的方法导致了大多数相关特征的选择,我们期望使用基于KDD'99的数据集的入侵检测研究将受益于对网络特征及其对攻击的影响的良好理解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Issues facing the development of a single-winged rotorcraft's control system Vehicular Network simulation propagation loss model parameter standardization in ns-3 and beyond Autonomous all-terrain vehicle steering Analysis of single side axial flux brushless DC motor with different number of stator electromagnetic poles Effectiveness of design projects in teaching Telecommunications Engineering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1