Data Mining and Knowledge Discovery最新文献_第3页

Explainable decomposition of nested dense subgraphs 嵌套密集子图的可解释分解

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-10 DOI: 10.1007/s10618-024-01053-8

Nikolaj Tatti

Discovering dense regions in a graph is a popular tool for analyzing graphs. While useful, analyzing such decompositions may be difficult without additional information. Fortunately, many real-world networks have additional information, namely node labels. In this paper we focus on finding decompositions that have dense inner subgraphs and that can be explained using labels. More formally, we construct a binary tree T with labels on non-leaves that we use to partition the nodes in the input graph. To measure the quality of the tree, we model the edges in the shell and the cross edges to the inner shells as a Bernoulli variable. We reward the decompositions with the dense regions by requiring that the model parameters are non-increasing. We show that our problem is NP-hard, even inapproximable if we constrain the size of the tree. Consequently, we propose a greedy algorithm that iteratively finds the best split and applies it to the current tree. We demonstrate how we can efficiently compute the best split by maintaining certain counters. Our experiments show that our algorithm can process networks with over million edges in few minutes. Moreover, we show that the algorithm can find the ground truth in synthetic data and produces interpretable decompositions when applied to real world networks.

发现图形中的密集区域是一种常用的图形分析工具。虽然很有用，但如果没有额外的信息，分析这种分解可能会很困难。幸运的是，现实世界中的许多网络都有额外的信息，即节点标签。在本文中，我们将重点放在寻找具有密集内部子图并且可以用标签解释的分解上。更正式地说，我们构建了一棵二叉树 T，树的非叶子上有标签，我们用它来分割输入图中的节点。为了衡量树的质量，我们将外壳中的边和到内部外壳的交叉边建模为伯努利变量。我们通过要求模型参数不递增来奖励具有密集区域的分解。我们的研究表明，我们的问题是 NP-困难的，如果我们限制树的大小，甚至是不可近似的。因此，我们提出了一种贪婪算法，通过迭代找到最佳分割，并将其应用于当前树。我们演示了如何通过维护某些计数器来高效计算最佳分割。实验表明，我们的算法可以在几分钟内处理超过百万条边的网络。此外，我们还展示了该算法可以在合成数据中找到基本事实，并在应用于真实世界的网络时产生可解释的分解。

{"title":"Explainable decomposition of nested dense subgraphs","authors":"Nikolaj Tatti","doi":"10.1007/s10618-024-01053-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01053-8","url":null,"abstract":"Discovering dense regions in a graph is a popular tool for analyzing graphs. While useful, analyzing such decompositions may be difficult without additional information. Fortunately, many real-world networks have additional information, namely node labels. In this paper we focus on finding decompositions that have dense inner subgraphs and that can be explained using labels. More formally, we construct a binary tree T with labels on non-leaves that we use to partition the nodes in the input graph. To measure the quality of the tree, we model the edges in the shell and the cross edges to the inner shells as a Bernoulli variable. We reward the decompositions with the dense regions by requiring that the model parameters are non-increasing. We show that our problem is NP-hard, even inapproximable if we constrain the size of the tree. Consequently, we propose a greedy algorithm that iteratively finds the best split and applies it to the current tree. We demonstrate how we can efficiently compute the best split by maintaining certain counters. Our experiments show that our algorithm can process networks with over million edges in few minutes. Moreover, we show that the algorithm can find the ground truth in synthetic data and produces interpretable decompositions when applied to real world networks.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"18 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141588412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Negative-sample-free knowledge graph embedding 无负样本知识图谱嵌入

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-09 DOI: 10.1007/s10618-024-01052-9

Adil Bahaj, Mounir Ghogho

Recently, knowledge graphs (KGs) have been shown to benefit many machine learning applications in multiple domains (e.g. self-driving, agriculture, bio-medicine, recommender systems, etc.). However, KGs suffer from incompleteness, which motivates the task of KG completion which consists of inferring new (unobserved) links between existing entities based on observed links. This task is achieved using either a probabilistic, rule-based, or embedding-based approach. The latter has been shown to consistently outperform the former approaches. It however relies on negative sampling, which supposes that every observed link is “true” and that every unobserved link is “false”. Negative sampling increases the computation complexity of the learning process and introduces noise in the learning. We propose NSF-KGE, a framework for KG embedding that does not require negative sampling, yet achieves performance comparable to that of the negative sampling-based approach. NSF-KGE employs objectives from the non-contrastive self-supervised literature to learn representations that are invariant to relation transformations (e.g. translation, scaling, rotation etc) while avoiding representation collapse.

最近，知识图谱（KGs）已被证明有利于多个领域（如自动驾驶、农业、生物医学、推荐系统等）的许多机器学习应用。然而，KGs 存在不完整性，这就需要完成 KG 的任务，即根据观察到的链接推断现有实体之间新的（未观察到的）链接。这项任务可以通过概率、基于规则或基于嵌入的方法来完成。事实证明，后者一直优于前者。不过，后者依赖于负抽样，即假设每个观察到的链接都是 "真 "的，而每个未观察到的链接都是 "假 "的。负抽样增加了学习过程的计算复杂度，并在学习中引入了噪声。我们提出的 NSF-KGE 是一种不需要负采样的 KG 嵌入框架，其性能可与基于负采样的方法相媲美。NSF-KGE 采用了非对比自监督文献中的目标，以学习对关系变换（如平移、缩放、旋转等）不变的表示，同时避免表示崩溃。

{"title":"Negative-sample-free knowledge graph embedding","authors":"Adil Bahaj, Mounir Ghogho","doi":"10.1007/s10618-024-01052-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01052-9","url":null,"abstract":"Recently, knowledge graphs (KGs) have been shown to benefit many machine learning applications in multiple domains (e.g. self-driving, agriculture, bio-medicine, recommender systems, etc.). However, KGs suffer from incompleteness, which motivates the task of KG completion which consists of inferring new (unobserved) links between existing entities based on observed links. This task is achieved using either a probabilistic, rule-based, or embedding-based approach. The latter has been shown to consistently outperform the former approaches. It however relies on negative sampling, which supposes that every observed link is “true” and that every unobserved link is “false”. Negative sampling increases the computation complexity of the learning process and introduces noise in the learning. We propose NSF-KGE, a framework for KG embedding that does not require negative sampling, yet achieves performance comparable to that of the negative sampling-based approach. NSF-KGE employs objectives from the non-contrastive self-supervised literature to learn representations that are invariant to relation transformations (e.g. translation, scaling, rotation etc) while avoiding representation collapse.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"14 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141570090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Knowledge graph embedding closed under composition 组成下封闭的知识图式嵌入

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-04 DOI: 10.1007/s10618-024-01050-x

Zhuoxun Zheng, Baifan Zhou, Hui Yang, Zhipeng Tan, Zequn Sun, Chunnong Li, Arild Waaler, Evgeny Kharlamov, Ahmet Soylu

Knowledge Graph Embedding (KGE) has attracted increasing attention. Relation patterns, such as symmetry and inversion, have received considerable focus. Among them, composition patterns are particularly important, as they involve nearly all relations in KGs. However, prior KGE approaches often consider relations to be compositional only if they are well-represented in the training data. Consequently, it can lead to performance degradation, especially for under-represented composition patterns. To this end, we propose HolmE, a general form of KGE with its relation embedding space closed under composition, namely that the composition of any two given relation embeddings remains within the embedding space. This property ensures that every relation embedding can compose, or be composed by other relation embeddings. It enhances HolmE’s capability to model under-represented (also called long-tail) composition patterns with limited learning instances. To our best knowledge, our work is pioneering in discussing KGE with this property of being closed under composition. We provide detailed theoretical proof and extensive experiments to demonstrate the notable advantages of HolmE in modelling composition patterns, particularly for long-tail patterns. Our results also highlight HolmE’s effectiveness in extrapolating to unseen relations through composition and its state-of-the-art performance on benchmark datasets.

知识图谱嵌入（KGE）已引起越来越多的关注。对称和反转等关系模式受到了广泛关注。其中，组成模式尤为重要，因为它们几乎涉及知识图谱中的所有关系。然而，先前的 KGE 方法通常认为，只有当关系在训练数据中得到充分体现时，它们才是组成关系。因此，这会导致性能下降，尤其是对于代表性不足的组成模式。为此，我们提出了 HolmE，它是 KGE 的一种一般形式，其关系嵌入空间在组成条件下是封闭的，即任何两个给定关系嵌入的组成都保持在嵌入空间内。这一特性确保了每个关系嵌入都能组成其他关系嵌入，或由其他关系嵌入组成。它增强了 HolmE 的建模能力，使其能够在有限的学习实例中对代表性不足（也称为长尾）的组成模式进行建模。据我们所知，我们的工作开创性地讨论了 KGE 在组成下封闭的特性。我们提供了详细的理论证明和大量实验，以证明 HolmE 在建模组合模式（尤其是长尾模式）方面的显著优势。我们的结果还凸显了 HolmE 在通过组成推断未知关系方面的有效性，以及它在基准数据集上的一流性能。

{"title":"Knowledge graph embedding closed under composition","authors":"Zhuoxun Zheng, Baifan Zhou, Hui Yang, Zhipeng Tan, Zequn Sun, Chunnong Li, Arild Waaler, Evgeny Kharlamov, Ahmet Soylu","doi":"10.1007/s10618-024-01050-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01050-x","url":null,"abstract":"Knowledge Graph Embedding (KGE) has attracted increasing attention. Relation patterns, such as symmetry and inversion, have received considerable focus. Among them, composition patterns are particularly important, as they involve nearly all relations in KGs. However, prior KGE approaches often consider relations to be compositional only if they are well-represented in the training data. Consequently, it can lead to performance degradation, especially for under-represented composition patterns. To this end, we propose HolmE, a general form of KGE with its relation embedding space closed under composition, namely that the composition of any two given relation embeddings remains within the embedding space. This property ensures that every relation embedding can compose, or be composed by other relation embeddings. It enhances HolmE’s capability to model under-represented (also called long-tail) composition patterns with limited learning instances. To our best knowledge, our work is pioneering in discussing KGE with this property of being closed under composition. We provide detailed theoretical proof and extensive experiments to demonstrate the notable advantages of HolmE in modelling composition patterns, particularly for long-tail patterns. Our results also highlight HolmE’s effectiveness in extrapolating to unseen relations through composition and its state-of-the-art performance on benchmark datasets.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"35 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards effective urban region-of-interest demand modeling via graph representation learning 通过图表示学习实现有效的城市兴趣区域需求建模

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-03 DOI: 10.1007/s10618-024-01049-4

Pu Wang, Jingya Sun, Wei Chen, Lei Zhao

Identifying the region’s functionalities and what the specific Point-of-Interest (POI) needs is essential for effective urban planning. However, due to the diversified and ambiguity nature of urban regions, there are still some significant challenges to be resolved in urban POI demand analysis. To this end, we propose a novel framework, in which Region-of-Interest Demand Modeling is enhanced through the graph representation learning, namely Variational Multi-graph Auto-encoding Fusion, aiming to effectively predict the ROI demand from both the POI level and category level. Specifically, we first divide the urban area into spatially differentiated neighborhood regions, extract the corresponding multi-dimensional natures, and then generate the Spatial-Attributed Region Graph (SARG). After that, we introduce an unsupervised multi-graph based variational auto-encoder to map regional profiles of SARG into latent space, and further retrieve the dynamic latent representations through probabilistic sampling and global fusing. Additionally, during the training process, a spatio-temporal constrained Bayesian algorithm is adopted to infer the destination POIs. Finally, extensive experiments are conducted on real-world dataset, which demonstrate our model significantly outperforms state-of-the-art baselines.

确定区域的功能和特定兴趣点（POI）的需求对于有效的城市规划至关重要。然而，由于城市区域的多样性和模糊性，城市兴趣点需求分析仍有一些重大挑战有待解决。为此，我们提出了一个新颖的框架，通过图表示学习（即变异多图自动编码融合）来增强兴趣区域需求建模，旨在从 POI 层面和类别层面有效预测 ROI 需求。具体来说，我们首先将城市区域划分为空间上不同的邻近区域，提取相应的多维性质，然后生成空间属性区域图（SARG）。然后，我们引入基于无监督多图的变异自动编码器，将 SARG 的区域轮廓映射到潜空间，并通过概率采样和全局融合进一步检索动态潜表征。此外，在训练过程中，还采用了时空约束贝叶斯算法来推断目的地 POI。最后，我们在真实世界的数据集上进行了大量实验，结果表明我们的模型明显优于最先进的基线模型。

{"title":"Towards effective urban region-of-interest demand modeling via graph representation learning","authors":"Pu Wang, Jingya Sun, Wei Chen, Lei Zhao","doi":"10.1007/s10618-024-01049-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01049-4","url":null,"abstract":"Identifying the region’s functionalities and what the specific Point-of-Interest (POI) needs is essential for effective urban planning. However, due to the diversified and ambiguity nature of urban regions, there are still some significant challenges to be resolved in urban POI demand analysis. To this end, we propose a novel framework, in which Region-of-Interest Demand Modeling is enhanced through the graph representation learning, namely Variational Multi-graph Auto-encoding Fusion, aiming to effectively predict the ROI demand from both the POI level and category level. Specifically, we first divide the urban area into spatially differentiated neighborhood regions, extract the corresponding multi-dimensional natures, and then generate the Spatial-Attributed Region Graph (SARG). After that, we introduce an unsupervised multi-graph based variational auto-encoder to map regional profiles of SARG into latent space, and further retrieve the dynamic latent representations through probabilistic sampling and global fusing. Additionally, during the training process, a spatio-temporal constrained Bayesian algorithm is adopted to infer the destination POIs. Finally, extensive experiments are conducted on real-world dataset, which demonstrate our model significantly outperforms state-of-the-art baselines.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"73 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141515797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Randomnet: clustering time series using untrained deep neural networks Randomnet：使用未经训练的深度神经网络对时间序列进行聚类

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-22 DOI: 10.1007/s10618-024-01048-5

Xiaosheng Li, Wenjie Xi, Jessica Lin

Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.

神经网络广泛应用于机器学习和数据挖掘。通常，这些网络需要经过训练，即根据输入数据调整网络内的权重（参数）。在这项工作中，我们提出了一种新方法--RandomNet，利用未经训练的深度神经网络对时间序列进行聚类。RandomNet 使用不同的随机权重集来提取时间序列的不同表征，然后将从这些不同表征中得出的聚类关系进行组合，从而得出最终的聚类结果。通过提取不同的表征，我们的模型可以有效地处理具有不同特征的时间序列。由于所有参数都是随机生成的，因此在这一过程中无需训练。我们对该方法的有效性进行了理论分析。为了验证该方法的性能，我们在著名的 UCR 时间序列档案中的所有 128 个数据集上进行了大量实验，并对结果进行了统计分析。这些数据集的大小、序列长度各不相同，而且来自不同的领域。实验结果表明，与现有的先进方法相比，所提出的方法具有很强的竞争力。

{"title":"Randomnet: clustering time series using untrained deep neural networks","authors":"Xiaosheng Li, Wenjie Xi, Jessica Lin","doi":"10.1007/s10618-024-01048-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01048-5","url":null,"abstract":"Neural networks are widely used in machine learning and data mining. Typically, these networks need to be trained, implying the adjustment of weights (parameters) within the network based on the input data. In this work, we propose a novel approach, RandomNet, that employs untrained deep neural networks to cluster time series. RandomNet uses different sets of random weights to extract diverse representations of time series and then ensembles the clustering relationships derived from these different representations to build the final clustering results. By extracting diverse representations, our model can effectively handle time series with different characteristics. Since all parameters are randomly generated, no training is required during the process. We provide a theoretical analysis of the effectiveness of the method. To validate its performance, we conduct extensive experiments on all of the 128 datasets in the well-known UCR time series archive and perform statistical analysis of the results. These datasets have different sizes, sequence lengths, and they are from diverse fields. The experimental results show that the proposed method is competitive compared with existing state-of-the-art methods.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"71 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust explainer recommendation for time series classification 时间序列分类的稳健解释器推荐

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-20 DOI: 10.1007/s10618-024-01045-8

Thu Trang Nguyen, Thach Le Nguyen, Georgiana Ifrim

Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explanability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques (e.g., LIME, SHAP, CAM) have been proposed and adapted for time series to provide explanation in the form of saliency maps, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to quantitatively evaluate and rank explanation methods for time series classification. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. The goal is to recommend the best explainer for a given time series classification dataset. We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of time-series datasets, as well as a real-world case study with known expert ground truth.

时间序列分类是一项处理时间序列的任务，是人类活动识别、体育分析和普通传感等领域常见的数据类型。在这一领域，人们对可解释性的兴趣与日俱增，因为解释是更好地理解数据和模型的关键。最近，有许多技术（如 LIME、SHAP、CAM）被提出并应用于时间序列，以显著性地图的形式提供解释。然而，显著性图可能而且经常会出现分歧，因此不清楚应该使用哪一个。本文提供了一个新颖的框架，用于对时间序列分类的解释方法进行量化评估和排序。我们展示了如何稳健地评估给定解释方法的信息量（即与分类任务的相关性），以及如何并排比较解释方法。我们的目标是为给定的时间序列分类数据集推荐最佳解释方法。我们提出了一个模型诊断解释评估框架 AMEE，用于为时间序列分类推荐基于显著性的解释。在这种方法中，数据扰动被添加到每个解释所引导的输入时间序列中。我们的研究结果表明，扰动时间序列的判别部分会导致分类准确率发生显著变化，而分类准确率可用于评估每种解释。为了适应不同类型的扰动和不同类型的分类器，我们汇总了不同扰动和分类器的准确率损失。通过这种新颖的方法，我们可以在一组不同的解释器（包括随机解释器和甲骨文解释器）中推荐最佳解释器。我们对合成数据集、各种时间序列数据集以及已知专家基本真相的真实世界案例研究进行了定量和定性分析。

{"title":"Robust explainer recommendation for time series classification","authors":"Thu Trang Nguyen, Thach Le Nguyen, Georgiana Ifrim","doi":"10.1007/s10618-024-01045-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01045-8","url":null,"abstract":"Time series classification is a task which deals with temporal sequences, a prevalent data type common in domains such as human activity recognition, sports analytics and general sensing. In this area, interest in explanability has been growing as explanation is key to understand the data and the model better. Recently, a great variety of techniques (e.g., LIME, SHAP, CAM) have been proposed and adapted for time series to provide explanation in the form of saliency maps, where the importance of each data point in the time series is quantified with a numerical value. However, the saliency maps can and often disagree, so it is unclear which one to use. This paper provides a novel framework to quantitatively evaluate and rank explanation methods for time series classification. We show how to robustly evaluate the informativeness of a given explanation method (i.e., relevance for the classification task), and how to compare explanations side-by-side. The goal is to recommend the best explainer for a given time series classification dataset. We propose AMEE, a Model-Agnostic Explanation Evaluation framework, for recommending saliency-based explanations for time series classification. In this approach, data perturbation is added to the input time series guided by each explanation. Our results show that perturbing discriminative parts of the time series leads to significant changes in classification accuracy, which can be used to evaluate each explanation. To be robust to different types of perturbations and different types of classifiers, we aggregate the accuracy loss across perturbations and classifiers. This novel approach allows us to recommend the best explainer among a set of different explainers, including random and oracle explainers. We provide a quantitative and qualitative analysis for synthetic datasets, a variety of time-series datasets, as well as a real-world case study with known expert ground truth.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-20 DOI: 10.1007/s10618-024-01043-w

Navid Mohammadi Foumani, Chang Wei Tan, Geoffrey I. Webb, Hamid Rezatofighi, Mahsa Salehi

We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called Series2Vec for self-supervised representation learning. Unlike the state-of-the-art methods in time series which rely on hand-crafted data augmentation, Series2Vec is trained by predicting the similarity between two series in both temporal and spectral domains through a self-supervised task. By leveraging the similarity prediction task, which has inherent meaning for a wide range of time series analysis tasks, Series2Vec eliminates the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at https://github.com/Navidfoumani/Series2Vec

我们认为，就可定义的有意义的自我监督学习任务形式而言，时间序列分析与视觉或自然语言处理在本质上有着根本的不同。受此启发，我们为自我监督表示学习引入了一种名为 Series2Vec 的新方法。与时间序列中依赖手工创建数据增强的最先进方法不同，Series2Vec 是通过自监督任务预测两个序列在时间和频谱领域的相似性来进行训练的。通过利用对各种时间序列分析任务具有内在意义的相似性预测任务，Series2Vec 消除了手工制作数据扩增的需要。为了进一步强制网络学习相似时间序列的相似表示，我们提出了一种新方法，即在训练期间对批次中的每个表示应用顺序不变的关注。我们在九个大型真实数据集和 UCR/UEA 档案中对 Series2Vec 进行了评估，结果表明，与当前最先进的时间序列自监督技术相比，Series2Vec 的性能有所提高。此外，我们的大量实验表明，Series2Vec 的性能可与完全监督式训练相媲美，并能在标签数据有限的数据集中提供高效率。最后，我们还展示了 Series2Vec 与其他表示学习模型的融合，从而提高了时间序列分类的性能。代码和模型开源于 https://github.com/Navidfoumani/Series2Vec

{"title":"Series2vec: similarity-based self-supervised representation learning for time series classification","authors":"Navid Mohammadi Foumani, Chang Wei Tan, Geoffrey I. Webb, Hamid Rezatofighi, Mahsa Salehi","doi":"10.1007/s10618-024-01043-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01043-w","url":null,"abstract":"We argue that time series analysis is fundamentally different in nature to either vision or natural language processing with respect to the forms of meaningful self-supervised learning tasks that can be defined. Motivated by this insight, we introduce a novel approach called Series2Vec for self-supervised representation learning. Unlike the state-of-the-art methods in time series which rely on hand-crafted data augmentation, Series2Vec is trained by predicting the similarity between two series in both temporal and spectral domains through a self-supervised task. By leveraging the similarity prediction task, which has inherent meaning for a wide range of time series analysis tasks, Series2Vec eliminates the need for hand-crafted data augmentation. To further enforce the network to learn similar representations for similar time series, we propose a novel approach that applies order-invariant attention to each representation within the batch during training. Our evaluation of Series2Vec on nine large real-world datasets, along with the UCR/UEA archive, shows enhanced performance compared to current state-of-the-art self-supervised techniques for time series. Additionally, our extensive experiments show that Series2Vec performs comparably with fully supervised training and offers high efficiency in datasets with limited-labeled data. Finally, we show that the fusion of Series2Vec with other representation learning models leads to enhanced performance for time series classification. Code and models are open-source at https://github.com/Navidfoumani/Series2Vec","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"139 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GeoRF: a geospatial random forest GeoRF：地理空间随机森林

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-19 DOI: 10.1007/s10618-024-01046-7

Margot Geerts, Seppe vanden Broucke, Jochen De Weerdt

The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.

地理空间领域越来越依赖于数据驱动方法，以便从不断增长的可用数据中提取可行的见解。尽管基于树的模型在捕捉特征和目标之间的复杂关系方面很有效，但在考虑空间因素方面却存在不足。这种局限性源于它们对单变量、轴平行分割的依赖，这种分割会在地图上形成矩形区域。为了解决这个问题并提高性能和可解释性，我们提出了一种解决方案，引入了两种新颖的双变量分割：专为地理坐标设计的斜向分割和高斯分割。我们的创新被称为地理空间随机森林（geoRF），它建立在地理空间回归树（GeoTrees）的基础上，能有效地结合地理特征并提取最大的空间洞察力。通过广泛的基准测试，我们表明，在一系列地理空间任务中，我们的 geoRF 模型优于传统的空间统计模型、其他空间 RF 变化、机器学习和深度学习方法。此外，我们还介绍了我们的方法相对于基准方法的计算时间复杂性。我们的预测图表明，与传统的树状模型相比，geoRF 能产生更稳健、更直观的决策边界。利用基于杂质的特征重要性度量，我们验证了 geoRF 在突出地理坐标重要性方面的有效性，尤其是在表现出明显空间模式的数据集中。

{"title":"GeoRF: a geospatial random forest","authors":"Margot Geerts, Seppe vanden Broucke, Jochen De Weerdt","doi":"10.1007/s10618-024-01046-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01046-7","url":null,"abstract":"The geospatial domain increasingly relies on data-driven methodologies to extract actionable insights from the growing volume of available data. Despite the effectiveness of tree-based models in capturing complex relationships between features and targets, they fall short when it comes to considering spatial factors. This limitation arises from their reliance on univariate, axis-parallel splits that result in rectangular areas on a map. To address this issue and enhance both performance and interpretability, we propose a solution that introduces two novel bivariate splits: an oblique and Gaussian split designed specifically for geographic coordinates. Our innovation, called Geospatial Random Forest (geoRF), builds upon Geospatial Regression Trees (GeoTrees) to effectively incorporate geographic features and extract maximum spatial insights. Through an extensive benchmark, we show that our geoRF model outperforms traditional spatial statistical models, other spatial RF variations, machine learning and deep learning methods across a range of geospatial tasks. Furthermore, we contextualize our method’s computational time complexity relative to baseline approaches. Our prediction maps illustrate that geoRF produces more robust and intuitive decision boundaries compared to conventional tree-based models. Utilizing impurity-based feature importance measures, we validate geoRF’s effectiveness in highlighting the significance of geographic coordinates, especially in data sets exhibiting pronounced spatial patterns.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modelling event sequence data by type-wise neural point process 用类型神经点过程对事件序列数据建模

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-17 DOI: 10.1007/s10618-024-01047-6

Bingqing Liu

Event sequence data widely exists in real life, where each event is typically represented as a tuple, event type and occurrence time. Recently, neural point process (NPP), a probabilistic model that learns the next event distribution with events history given, has gained a lot of attention for event sequence modelling. Existing NPP models use one single vector to encode the whole events history. However, each type of event has its own historical events of concern, which should have led to a different encoding for events history. To this end, we propose Type-wise Neural Point Process (TNPP), with each type of event having a history vector to encode the historical events of its own interest. Type-wise encoding further leads to the realization of type-wise decoding, which together makes a more effective neural point process. Experimental results on six datasets show that TNPP outperforms existing models on the event type prediction task under both extrapolation and interpolation setting. Moreover, the results in terms of scalability and interpretability show that TNPP scales well to datasets with many event types and can provide high-quality event dependencies for interpretation. The code and data can be found at https://github.com/lbq8942/TNPP.

事件序列数据广泛存在于现实生活中，每个事件通常用元组、事件类型和发生时间来表示。最近，神经点过程（NPP）这种根据事件历史记录学习下一个事件分布的概率模型在事件序列建模方面获得了广泛关注。现有的 NPP 模型使用单一向量来编码整个事件历史。然而，每种类型的事件都有自己的历史关注事件，这就需要对事件历史进行不同的编码。为此，我们提出了类型化神经点过程（TNPP），每种类型的事件都有一个历史向量来编码其自身关注的历史事件。通过类型化编码，可以进一步实现类型化解码，从而形成更有效的神经点过程。在六个数据集上的实验结果表明，在事件类型预测任务中，TNPP 在外推法和内插法设置下均优于现有模型。此外，在可扩展性和可解释性方面的结果表明，TNPP 能很好地扩展到具有多种事件类型的数据集，并能为解释提供高质量的事件依赖关系。代码和数据可在 https://github.com/lbq8942/TNPP 上找到。

{"title":"Modelling event sequence data by type-wise neural point process","authors":"Bingqing Liu","doi":"10.1007/s10618-024-01047-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01047-6","url":null,"abstract":"Event sequence data widely exists in real life, where each event is typically represented as a tuple, event type and occurrence time. Recently, neural point process (NPP), a probabilistic model that learns the next event distribution with events history given, has gained a lot of attention for event sequence modelling. Existing NPP models use one single vector to encode the whole events history. However, each type of event has its own historical events of concern, which should have led to a different encoding for events history. To this end, we propose Type-wise Neural Point Process (TNPP), with each type of event having a history vector to encode the historical events of its own interest. Type-wise encoding further leads to the realization of type-wise decoding, which together makes a more effective neural point process. Experimental results on six datasets show that TNPP outperforms existing models on the event type prediction task under both extrapolation and interpolation setting. Moreover, the results in terms of scalability and interpretability show that TNPP scales well to datasets with many event types and can provide high-quality event dependencies for interpretation. The code and data can be found at https://github.com/lbq8942/TNPP.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The impact of variable ordering on Bayesian network structure learning 变量排序对贝叶斯网络结构学习的影响

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-08 DOI: 10.1007/s10618-024-01044-9

Neville K. Kitson, Anthony C. Constantinou

Causal Bayesian Networks (CBNs) provide an important tool for reasoning under uncertainty with potential application to many complex causal systems. Structure learning algorithms that can tell us something about the causal structure of these systems are becoming increasingly important. In the literature, the validity of these algorithms is often tested for sensitivity over varying sample sizes, hyper-parameters, and occasionally objective functions, but the effect of the order in which the variables are read from data is rarely quantified. We show that many commonly-used algorithms, both established and state-of-the-art, are more sensitive to variable ordering than these other factors when learning CBNs from discrete variables. This effect is strongest in hill-climbing and its variants where we explain how it arises, but extends to hybrid, and to a lesser-extent, constraint-based algorithms. Because the variable ordering is arbitrary, any significant effect it has on learnt graph accuracy is concerning, and raises questions about the validity of both many older and more recent results produced by these algorithms in practical applications and their rankings in performance evaluations.

因果贝叶斯网络（CBN）为不确定条件下的推理提供了一个重要工具，可应用于许多复杂的因果系统。能告诉我们这些系统因果结构的结构学习算法正变得越来越重要。在文献中，这些算法的有效性经常在不同的样本大小、超参数以及目标函数中进行敏感性测试，但从数据中读取变量的顺序所产生的影响却很少被量化。我们的研究表明，在从离散变量学习 CBN 时，许多常用算法（包括成熟算法和最新算法）对变量排序的敏感度要高于其他因素。这种影响在爬山算法及其变体中最为明显，我们解释了这种影响是如何产生的，但这种影响也延伸到了混合算法中，并在较小程度上延伸到了基于约束的算法中。由于变量排序是任意的，因此它对学习图准确性的任何显著影响都是令人担忧的，同时也对这些算法在实际应用中产生的许多较早和较新结果的有效性及其在性能评估中的排名提出了质疑。

{"title":"The impact of variable ordering on Bayesian network structure learning","authors":"Neville K. Kitson, Anthony C. Constantinou","doi":"10.1007/s10618-024-01044-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01044-9","url":null,"abstract":"Causal Bayesian Networks (CBNs) provide an important tool for reasoning under uncertainty with potential application to many complex causal systems. Structure learning algorithms that can tell us something about the causal structure of these systems are becoming increasingly important. In the literature, the validity of these algorithms is often tested for sensitivity over varying sample sizes, hyper-parameters, and occasionally objective functions, but the effect of the order in which the variables are read from data is rarely quantified. We show that many commonly-used algorithms, both established and state-of-the-art, are more sensitive to variable ordering than these other factors when learning CBNs from discrete variables. This effect is strongest in hill-climbing and its variants where we explain how it arises, but extends to hybrid, and to a lesser-extent, constraint-based algorithms. Because the variable ordering is arbitrary, any significant effect it has on learnt graph accuracy is concerning, and raises questions about the validity of both many older and more recent results produced by these algorithms in practical applications and their rankings in performance evaluations.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141503031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0