{"title":"基于偏抽样密度的局部离群点检测算法","authors":"Peiguo Fu, Xiaohui Hu","doi":"10.1109/FSKD.2016.7603357","DOIUrl":null,"url":null,"abstract":"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.","PeriodicalId":373155,"journal":{"name":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Biased-sampling of density-based local outlier detection algorithm\",\"authors\":\"Peiguo Fu, Xiaohui Hu\",\"doi\":\"10.1109/FSKD.2016.7603357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.\",\"PeriodicalId\":373155,\"journal\":{\"name\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"volume\":\"2017 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FSKD.2016.7603357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2016.7603357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Biased-sampling of density-based local outlier detection algorithm
Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.