An Unsupervised Feature Selection Method for Data-Driven Anomaly Detection Systems

N. Almusallam
{"title":"An Unsupervised Feature Selection Method for Data-Driven Anomaly Detection Systems","authors":"N. Almusallam","doi":"10.1109/WETICE49692.2020.00016","DOIUrl":null,"url":null,"abstract":"Feature selection has been widely used as a pre-processing step that helps to optimise the performance of data-driven intrusion/anomaly detection systems in achieving their tasks. For example, when grouping the data into normal and outlier groups, the existence of redundant and non-representative features would reduce the accuracy of classifying the data points and would also increase the processing time. Therefore, feature selection is applied as a pre-processing step for anomaly detection systems in order to optimize their classification accuracy and running time. Most of the existing feature selection methods have limitations when dealing with high-dimensional data, as they search different subsets of features to find accurate representations of all features. Obviously, searching for different combinations of features is computationally very expensive, which makes existing work not efficient for high-dimensional data. The work carried out here, which relates to the design of a similaritybased unsupervised feature selection method for an efficient and accurate anomaly detection (UFSAD), tackles mainly the selection of reduced set of representative features from high-dimensional data without the data class labels. The selected features should improve the accuracy and performance of anomaly detection systems due to the elimination of redundant and non-representative features. The proposed UFSAD method extends the k-mean clustering algorithm to partition the features into k clusters based on a similarity measure (e.g. PCC - Pearson Correlation Coefficient, LSRE - Least Square Regression Error or MICI - Maximal Information Compression Index) in order to accurately partition the features. Then the proposed centroid-based feature selection method is used, where the feature with the closest similarity to its cluster centroid is selected as the representative feature while others are discarded. Extensive experimental work has shown that UFSAD can generate a reduced representative and non-redundant feature set that achieves good classification accuracy in comparison with well-known unsupervised features selection methods.","PeriodicalId":114214,"journal":{"name":"2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WETICE49692.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Feature selection has been widely used as a pre-processing step that helps to optimise the performance of data-driven intrusion/anomaly detection systems in achieving their tasks. For example, when grouping the data into normal and outlier groups, the existence of redundant and non-representative features would reduce the accuracy of classifying the data points and would also increase the processing time. Therefore, feature selection is applied as a pre-processing step for anomaly detection systems in order to optimize their classification accuracy and running time. Most of the existing feature selection methods have limitations when dealing with high-dimensional data, as they search different subsets of features to find accurate representations of all features. Obviously, searching for different combinations of features is computationally very expensive, which makes existing work not efficient for high-dimensional data. The work carried out here, which relates to the design of a similaritybased unsupervised feature selection method for an efficient and accurate anomaly detection (UFSAD), tackles mainly the selection of reduced set of representative features from high-dimensional data without the data class labels. The selected features should improve the accuracy and performance of anomaly detection systems due to the elimination of redundant and non-representative features. The proposed UFSAD method extends the k-mean clustering algorithm to partition the features into k clusters based on a similarity measure (e.g. PCC - Pearson Correlation Coefficient, LSRE - Least Square Regression Error or MICI - Maximal Information Compression Index) in order to accurately partition the features. Then the proposed centroid-based feature selection method is used, where the feature with the closest similarity to its cluster centroid is selected as the representative feature while others are discarded. Extensive experimental work has shown that UFSAD can generate a reduced representative and non-redundant feature set that achieves good classification accuracy in comparison with well-known unsupervised features selection methods.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
数据驱动异常检测系统的无监督特征选择方法
特征选择已被广泛用作预处理步骤,有助于优化数据驱动的入侵/异常检测系统在完成其任务时的性能。例如,当将数据分为正常组和离群组时,冗余和非代表性特征的存在会降低数据点分类的准确性,也会增加处理时间。因此,将特征选择作为异常检测系统的预处理步骤,以优化其分类精度和运行时间。大多数现有的特征选择方法在处理高维数据时存在局限性,因为它们需要搜索不同的特征子集来找到所有特征的准确表示。显然,搜索不同的特征组合在计算上是非常昂贵的,这使得现有的工作对高维数据的效率不高。本文的工作涉及设计一种基于相似度的无监督特征选择方法,用于高效准确的异常检测(UFSAD),主要解决了在没有数据类别标签的情况下从高维数据中选择代表性特征的简化集的问题。由于消除了冗余和非代表性特征,所选择的特征应该提高异常检测系统的准确性和性能。本文提出的UFSAD方法扩展了k-mean聚类算法,基于相似性度量(如PCC - Pearson相关系数、LSRE -最小二乘回归误差或MICI -最大信息压缩指数)将特征划分为k个聚类,以准确划分特征。然后使用提出的基于聚类质心的特征选择方法,选择与其聚类质心相似度最接近的特征作为代表特征,舍弃其他特征。大量的实验工作表明,与众所周知的无监督特征选择方法相比,UFSAD可以生成减少代表性和非冗余的特征集,达到良好的分类精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Towards an Automatic Identification of Microservices from Business Processes A FIPA-ACL based communication utility for Unity Secure Data Analytics for IoT Cloud-enabled Framework Using Intel SGX Application and preliminary evaluation of Anontool applied in the anomaly detection module Specification and verification of railway safety-critical systems using TLA+: A Case Study
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1