An Unsupervised Feature Selection Method for Data-Driven Anomaly Detection Systems

2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE) Pub Date : 2020-09-01 DOI:10.1109/WETICE49692.2020.00016

N. Almusallam

{"title":"An Unsupervised Feature Selection Method for Data-Driven Anomaly Detection Systems","authors":"N. Almusallam","doi":"10.1109/WETICE49692.2020.00016","DOIUrl":null,"url":null,"abstract":"Feature selection has been widely used as a pre-processing step that helps to optimise the performance of data-driven intrusion/anomaly detection systems in achieving their tasks. For example, when grouping the data into normal and outlier groups, the existence of redundant and non-representative features would reduce the accuracy of classifying the data points and would also increase the processing time. Therefore, feature selection is applied as a pre-processing step for anomaly detection systems in order to optimize their classification accuracy and running time. Most of the existing feature selection methods have limitations when dealing with high-dimensional data, as they search different subsets of features to find accurate representations of all features. Obviously, searching for different combinations of features is computationally very expensive, which makes existing work not efficient for high-dimensional data. The work carried out here, which relates to the design of a similaritybased unsupervised feature selection method for an efficient and accurate anomaly detection (UFSAD), tackles mainly the selection of reduced set of representative features from high-dimensional data without the data class labels. The selected features should improve the accuracy and performance of anomaly detection systems due to the elimination of redundant and non-representative features. The proposed UFSAD method extends the k-mean clustering algorithm to partition the features into k clusters based on a similarity measure (e.g. PCC - Pearson Correlation Coefficient, LSRE - Least Square Regression Error or MICI - Maximal Information Compression Index) in order to accurately partition the features. Then the proposed centroid-based feature selection method is used, where the feature with the closest similarity to its cluster centroid is selected as the representative feature while others are discarded. Extensive experimental work has shown that UFSAD can generate a reduced representative and non-redundant feature set that achieves good classification accuracy in comparison with well-known unsupervised features selection methods.","PeriodicalId":114214,"journal":{"name":"2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WETICE49692.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Feature selection has been widely used as a pre-processing step that helps to optimise the performance of data-driven intrusion/anomaly detection systems in achieving their tasks. For example, when grouping the data into normal and outlier groups, the existence of redundant and non-representative features would reduce the accuracy of classifying the data points and would also increase the processing time. Therefore, feature selection is applied as a pre-processing step for anomaly detection systems in order to optimize their classification accuracy and running time. Most of the existing feature selection methods have limitations when dealing with high-dimensional data, as they search different subsets of features to find accurate representations of all features. Obviously, searching for different combinations of features is computationally very expensive, which makes existing work not efficient for high-dimensional data. The work carried out here, which relates to the design of a similaritybased unsupervised feature selection method for an efficient and accurate anomaly detection (UFSAD), tackles mainly the selection of reduced set of representative features from high-dimensional data without the data class labels. The selected features should improve the accuracy and performance of anomaly detection systems due to the elimination of redundant and non-representative features. The proposed UFSAD method extends the k-mean clustering algorithm to partition the features into k clusters based on a similarity measure (e.g. PCC - Pearson Correlation Coefficient, LSRE - Least Square Regression Error or MICI - Maximal Information Compression Index) in order to accurately partition the features. Then the proposed centroid-based feature selection method is used, where the feature with the closest similarity to its cluster centroid is selected as the representative feature while others are discarded. Extensive experimental work has shown that UFSAD can generate a reduced representative and non-redundant feature set that achieves good classification accuracy in comparison with well-known unsupervised features selection methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据驱动异常检测系统的无监督特征选择方法

特征选择已被广泛用作预处理步骤，有助于优化数据驱动的入侵/异常检测系统在完成其任务时的性能。例如，当将数据分为正常组和离群组时，冗余和非代表性特征的存在会降低数据点分类的准确性，也会增加处理时间。因此，将特征选择作为异常检测系统的预处理步骤，以优化其分类精度和运行时间。大多数现有的特征选择方法在处理高维数据时存在局限性，因为它们需要搜索不同的特征子集来找到所有特征的准确表示。显然，搜索不同的特征组合在计算上是非常昂贵的，这使得现有的工作对高维数据的效率不高。本文的工作涉及设计一种基于相似度的无监督特征选择方法，用于高效准确的异常检测(UFSAD)，主要解决了在没有数据类别标签的情况下从高维数据中选择代表性特征的简化集的问题。由于消除了冗余和非代表性特征，所选择的特征应该提高异常检测系统的准确性和性能。本文提出的UFSAD方法扩展了k-mean聚类算法，基于相似性度量(如PCC - Pearson相关系数、LSRE -最小二乘回归误差或MICI -最大信息压缩指数)将特征划分为k个聚类，以准确划分特征。然后使用提出的基于聚类质心的特征选择方法，选择与其聚类质心相似度最接近的特征作为代表特征，舍弃其他特征。大量的实验工作表明，与众所周知的无监督特征选择方法相比，UFSAD可以生成减少代表性和非冗余的特征集，达到良好的分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE 29th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE)

自引率

0.00%

发文量