基于Jensen–Shannon散度和图结构分析的高维数据异常点挖掘

IF 2.6 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Journal of Physics Complexity Pub Date : 2022-12-06 DOI:10.1088/2632-072X/aca94a

Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller

{"title":"基于Jensen–Shannon散度和图结构分析的高维数据异常点挖掘","authors":"Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller","doi":"10.1088/2632-072X/aca94a","DOIUrl":null,"url":null,"abstract":"Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.","PeriodicalId":53211,"journal":{"name":"Journal of Physics Complexity","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis\",\"authors\":\"Alex S O Toledo, Riccardo Silini, L. Carpi, C. Masoller\",\"doi\":\"10.1088/2632-072X/aca94a\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.\",\"PeriodicalId\":53211,\"journal\":{\"name\":\"Journal of Physics Complexity\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2022-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Physics Complexity\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1088/2632-072X/aca94a\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Physics Complexity","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/2632-072X/aca94a","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 1

摘要

可靠的异常/异常点检测算法在许多领域都有实际应用。例如，异常检测可以过滤和清理用于训练机器学习算法的数据，从而提高其性能。然而，当数据是高维的时，异常值挖掘是具有挑战性的，并且已经针对不同类型的数据（时间、空间、网络等）提出了不同的方法。在这里，我们提出了一种在通用数据集中挖掘异常值的方法，其中可以定义数据集元素之间的有意义的距离。该方法基于定义一个完全连接的无向图，其中节点是数据集的元素，链接的权重是节点之间的距离。异常值分数是通过分析图的结构来定义的，特别是通过使用Jensen–Shannon（JS）散度来比较不同节点的权重分布。我们使用公开的信用卡交易数据库演示了该方法，其中一些交易被标记为欺诈。我们将其与使用欧几里得距离和图渗滤时获得的性能进行了比较，并表明JS发散导致了性能的提高，但增加了计算成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Outlier mining in high-dimensional data using the Jensen–Shannon divergence and graph structure analysis

Reliable anomaly/outlier detection algorithms have practical applications in many fields. For instance, anomaly detection allows to filter and clean the data used to train machine learning algorithms, improving their performance. However, outlier mining is challenging when the data is high-dimensional, and different approaches have been proposed for different types of data (temporal, spatial, network, etc). Here we propose a methodology to mine outliers in generic datasets in which it is possible to define a meaningful distance between elements of the dataset. The methodology is based on defining a fully connected, undirected graph, where the nodes are the elements of the dataset and the links have weights that are the distances between the nodes. Outlier scores are defined by analyzing the structure of the graph, in particular, by using the Jensen–Shannon (JS) divergence to compare the distributions of weights of different nodes. We demonstrate the method using a publicly available database of credit-card transactions, where some of the transactions are labeled as frauds. We compare with the performance obtained when using Euclidean distances and graph percolation, and show that the JS divergence leads to performance improvement, but increases the computational cost.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊