Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis

IF 2.6 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Journal of Physics Complexity Pub Date : 2022-12-06 DOI:10.1088/2632-072X/aca94a
Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller
{"title":"Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis","authors":"Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller","doi":"10.1088/2632-072X/aca94a","DOIUrl":null,"url":null,"abstract":"Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.","PeriodicalId":53211,"journal":{"name":"Journal of Physics Complexity","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Physics Complexity","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2632-072X/aca94a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 1

Abstract

Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于Jensen–Shannon散度和图结构分析的高维数据异常点挖掘
可靠的异常/异常点检测算法在许多领域都有实际应用。例如,异常检测可以过滤和清理用于训练机器学习算法的数据,从而提高其性能。然而,当数据是高维的时,异常值挖掘是具有挑战性的,并且已经针对不同类型的数据(时间、空间、网络等)提出了不同的方法。在这里,我们提出了一种在通用数据集中挖掘异常值的方法,其中可以定义数据集元素之间的有意义的距离。该方法基于定义一个完全连接的无向图,其中节点是数据集的元素,链接的权重是节点之间的距离。异常值分数是通过分析图的结构来定义的,特别是通过使用Jensen–Shannon(JS)散度来比较不同节点的权重分布。我们使用公开的信用卡交易数据库演示了该方法,其中一些交易被标记为欺诈。我们将其与使用欧几里得距离和图渗滤时获得的性能进行了比较,并表明JS发散导致了性能的提高,但增加了计算成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Physics Complexity
Journal of Physics Complexity Computer Science-Information Systems
CiteScore
4.30
自引率
11.10%
发文量
45
审稿时长
14 weeks
期刊最新文献
Persistent Mayer Dirac. Fitness-based growth of directed networks with hierarchy The ultrametric backbone is the union of all minimum spanning forests. Exploring the space of graphs with fixed discrete curvatures Augmentations of Forman’s Ricci curvature and their applications in community detection
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1