基于偏抽样密度的局部离群点检测算法

2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) Pub Date : 2016-08-01 DOI:10.1109/FSKD.2016.7603357

Peiguo Fu, Xiaohui Hu

{"title":"基于偏抽样密度的局部离群点检测算法","authors":"Peiguo Fu, Xiaohui Hu","doi":"10.1109/FSKD.2016.7603357","DOIUrl":null,"url":null,"abstract":"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.","PeriodicalId":373155,"journal":{"name":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Biased-sampling of density-based local outlier detection algorithm\",\"authors\":\"Peiguo Fu, Xiaohui Hu\",\"doi\":\"10.1109/FSKD.2016.7603357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.\",\"PeriodicalId\":373155,\"journal\":{\"name\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"volume\":\"2017 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FSKD.2016.7603357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2016.7603357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

异常检测是机器学习和数据挖掘领域的一个研究热点。目前基于距离或最近邻的离群点挖掘方法在处理高维海量数据时，运算时间过长。为了提高算法的检测效果，人们提出了许多改进方法，但仍不能满足日益增长的数据需求，检测效果不佳。为此，本文提出了一种基于偏采样的密度异常检测算法。首先，为了避免复杂的核函数估计和积分，我们将数据集划分为网格，并使用网格中的数据点数作为近似密度。为了达到降低划分簇计算复杂度的目的，我们在计算数据点个数的同时，使用哈希表方法将网格映射到哈希表单元。然后，我们将在局部具有相似密度的相邻网格卷起来，然后计算组合数据簇的近似密度。接下来我们使用基于概率的偏抽样方法来检测需要检测的数据有一个子集;然后使用基于局部离群点检测密度的方法计算子集中每个目标的异常因子。由于使用的是有偏差的抽样数据，异常因素既有局部异常因素，也有全局异常因素;当我们得到子集中每个对象的异常因子后，该点的得分越高，异常程度越高。在各种人工和真实数据集上的实验证明，与以往的相关方法相比，我们的方法具有更好的准确性、可扩展性和更高的计算效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Biased-sampling of density-based local outlier detection algorithm

Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)

自引率

0.00%

发文量