基于偏抽样密度的局部离群点检测算法

Peiguo Fu, Xiaohui Hu
{"title":"基于偏抽样密度的局部离群点检测算法","authors":"Peiguo Fu, Xiaohui Hu","doi":"10.1109/FSKD.2016.7603357","DOIUrl":null,"url":null,"abstract":"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.","PeriodicalId":373155,"journal":{"name":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","volume":"2017 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Biased-sampling of density-based local outlier detection algorithm\",\"authors\":\"Peiguo Fu, Xiaohui Hu\",\"doi\":\"10.1109/FSKD.2016.7603357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.\",\"PeriodicalId\":373155,\"journal\":{\"name\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"volume\":\"2017 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FSKD.2016.7603357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FSKD.2016.7603357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

摘要

异常检测是机器学习和数据挖掘领域的一个研究热点。目前基于距离或最近邻的离群点挖掘方法在处理高维海量数据时,运算时间过长。为了提高算法的检测效果,人们提出了许多改进方法,但仍不能满足日益增长的数据需求,检测效果不佳。为此,本文提出了一种基于偏采样的密度异常检测算法。首先,为了避免复杂的核函数估计和积分,我们将数据集划分为网格,并使用网格中的数据点数作为近似密度。为了达到降低划分簇计算复杂度的目的,我们在计算数据点个数的同时,使用哈希表方法将网格映射到哈希表单元。然后,我们将在局部具有相似密度的相邻网格卷起来,然后计算组合数据簇的近似密度。接下来我们使用基于概率的偏抽样方法来检测需要检测的数据有一个子集;然后使用基于局部离群点检测密度的方法计算子集中每个目标的异常因子。由于使用的是有偏差的抽样数据,异常因素既有局部异常因素,也有全局异常因素;当我们得到子集中每个对象的异常因子后,该点的得分越高,异常程度越高。在各种人工和真实数据集上的实验证明,与以往的相关方法相比,我们的方法具有更好的准确性、可扩展性和更高的计算效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Biased-sampling of density-based local outlier detection algorithm
Anomaly detection is a hot research field in the area of machine learning and data mining. The current outlier mining approaches which are based on the distance or the nearest neighbor are resulted in too long operation time results when using for the high-dimensional and massive data. Many improvements have been proposed to improve the results of the algorithms, but not yet satisfy the demand of the increasing data, the detection is ineffective. So, this paper presents a biased sampling-based of density anomaly detection algorithm. Firstly, In order to avoid complex kernel function estimation and integration, we divide the data set as grids and use the number of data points in the grid as an approximate density. In order to achieve the purpose of reducing the complexity of calculating the divided cluster, we use the hash table method to map the grid to the hash table unit while calculate the number of data points. After that we roll-up the neighbor grids which has the similar density in local and then calculate the approximate density of the combined data clusters. Next we use the probability-based biased sampling method to detect the data required detection to have a subset; then we use the method based on the density of local outlier detection to calculate the abnormal factor of each object in the subset. Because of using the biased sampling data, the abnormal factor both local outlier factor and global outlier factor; after we have the abnormal factor of each object in the subset, the higher the score of the point is, the higher the degree of outliers. The experiments on various artificial and real-life data sets confirm that, compared with the previous related methods, our method has better accuracy, scalability, and more efficient computation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A novel electrons drifting algorithm for non-linear optimization problems Performance assessment of fault classifier of chemical plant based on support vector machine A theoretical line losses calculation method of distribution system based on boosting algorithm Building vietnamese dependency treebank based on Chinese-Vietnamese bilingual word alignment Optimizing self-adaptive gender ratio of elephant search algorithm by min-max strategy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1