Data Mining and Knowledge Discovery最新文献

Missing value replacement in strings and applications.

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2025-01-01 Epub Date: 2025-01-22 DOI: 10.1007/s10618-024-01074-3

Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering

Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.

{"title":"Missing value replacement in strings and applications.","authors":"Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering","doi":"10.1007/s10618-024-01074-3","DOIUrl":"10.1007/s10618-024-01074-3","url":null,"abstract":"Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 2","pages":"12"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11754389/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143048707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FRUITS: feature extraction using iterated sums for time series classification FRUITS：利用迭代和进行时间序列分类的特征提取

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-09-14 DOI: 10.1007/s10618-024-01068-1

Joscha Diehl, Richard Krieg

We introduce a pipeline for time series classification that extracts features based on the iterated-sums signature (ISS) and then applies a linear classifier. These features are intrinsically nonlinear, capture chronological information, and, under certain settings, are invariant to a form of time-warping. We achieve competitive results, both in accuracy and speed, on the UCR archive. We make our code available at https://github.com/irkri/fruits.

我们介绍了一种时间序列分类方法，它根据迭和特征（ISS）提取特征，然后应用线性分类器。这些特征本质上是非线性的，能捕捉时间信息，而且在某些设置下，不受某种形式的时间扭曲的影响。我们在 UCR 档案中取得了极具竞争力的准确性和速度。我们在 https://github.com/irkri/fruits 网站上提供了我们的代码。

引用次数: 0

Bounding the family-wise error rate in local causal discovery using Rademacher averages 利用拉德马赫平均值限定局部因果发现中的族内误差率

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-09-09 DOI: 10.1007/s10618-024-01069-0

Dario Simionato, Fabio Vandin

Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.

人们提出了许多算法来学习观察数据中目标变量周围的局部图形结构，重点是两组变量。第一个变量集称为父子变量集（PC），包含所有与目标变量直接相关的变量；第二个变量集称为马尔可夫边界变量集（MB），是对目标变量具有最佳预测性能的最小变量集。在本文中，我们针对 PC 和 MB 发现任务介绍了两种新型算法，它们都能严格保证全族误差率（FWER），即输出中报告任何假阳性的概率。我们的算法使用了统计学习理论中的一个关键概念--拉德马赫平均值，以适当考虑此类任务中出现的多重假设检验问题。我们在模拟数据上进行的评估表明，我们的算法能正确控制 FWER，而广泛使用的算法即使对多重假设检验进行了校正，也不能保证不会出现错误发现。我们的实验还表明，我们的算法能识别真实世界数据中的有意义关系。

{"title":"Bounding the family-wise error rate in local causal discovery using Rademacher averages","authors":"Dario Simionato, Fabio Vandin","doi":"10.1007/s10618-024-01069-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01069-0","url":null,"abstract":"Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack 通过基于机器学习的再识别攻击评估匿名文件的披露风险

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-09-03 DOI: 10.1007/s10618-024-01066-3

Benet Manzanares-Salor, David Sánchez, Pierre Lison

The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.

对于许多数据挖掘和机器学习任务来说，提供描述以人为中心的特征和行为的文本数据至关重要。然而，包含个人信息的数据在二次使用前应进行匿名处理。过去几年中提出了多种文本匿名化方法，这些方法的标准评估方法是将其输出结果与基于人的匿名化方法进行比较。残余披露风险是用召回率指标来估算的，它量化了匿名算法成功检测到的人工注释的重新识别术语的比例。然而，召回率并不是一种风险度量，它有几个缺点。首先，它需要一个唯一的基本事实，而这对于文本匿名化来说并不成立，因为在文本匿名化中，有几种掩码选择可能同样有效，以防止重新识别。其次，它依赖于人的判断，而人的判断本身是主观的，容易出错。最后，召回度量对术语的加权是统一的，因此忽略了这样一个事实，即某些遗漏术语对披露风险的影响可能比其他术语大得多。为了克服这些缺点，我们在本文中提出了一种新方法，通过自动再识别攻击来评估匿名文本的披露风险。我们将攻击形式化为多类分类任务，并利用最先进的神经语言模型来汇总攻击者可能用于构建分类器的数据源。我们通过评估几种文本匿名化方法在不同攻击配置下的泄露风险来说明我们方法的有效性。实证结果表明，大多数现有的匿名化方法都存在很大的隐私风险。

{"title":"Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack","authors":"Benet Manzanares-Salor, David Sánchez, Pierre Lison","doi":"10.1007/s10618-024-01066-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01066-3","url":null,"abstract":"The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient learning with projected histograms 利用投影直方图进行高效学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-09-01 DOI: 10.1007/s10618-024-01063-6

Zhanliang Huang, Ata Kabán, Henry Reeve

High dimensional learning is a perennial problem due to challenges posed by the “curse of dimensionality”; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings.

由于 "维度诅咒 "带来的挑战，高维学习是一个长期存在的问题；学习通常需要更多的计算资源和更多的训练数据。在差异化隐私（DP）设置中，为了达到所需的隐私性，需要在每个维度上添加噪声，这进一步加剧了问题的严重性。在本文中，我们提出了一种基于数据低维随机投影 (RP) 构建的直方图的简单方法，可以一次性解决所有这些问题。我们的方法利用 RP 来利用数据中隐藏的低维结构，既提高了计算效率，又改善了与样本大小相关的误差收敛性--在这种情况下，只需较少的训练数据即可进行学习。我们还提出了一种高效差异化隐私（DP）分类的变体，它进一步利用了直方图构建和基于 RP 的降维的数据无关性，从而有效地管理了隐私预算。我们对算法在几种环境下的泛化进行了详细而严谨的理论分析，表明我们的方法能够利用数据的低维结构，改善隐私所需的噪声不良影响，并在最低条件下具有良好的泛化能力。我们还通过实验证实了我们的发现，并证明我们的算法在非隐私和隐私环境下都能达到具有竞争力的分类准确性。

{"title":"Efficient learning with projected histograms","authors":"Zhanliang Huang, Ata Kabán, Henry Reeve","doi":"10.1007/s10618-024-01063-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01063-6","url":null,"abstract":"High dimensional learning is a perennial problem due to challenges posed by the “curse of dimensionality”; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Opinion dynamics in social networks incorporating higher-order interactions 包含高阶互动的社交网络中的舆论动态

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-30 DOI: 10.1007/s10618-024-01064-5

Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen

The issue of opinion sharing and formation has received considerable attention in the academic literature, and a few models have been proposed to study this problem. However, existing models are limited to the interactions among nearest neighbors, with those second, third, and higher-order neighbors only considered indirectly, despite the fact that higher-order interactions occur frequently in real social networks. In this paper, we develop a new model for opinion dynamics by incorporating long-range interactions based on higher-order random walks that can explicitly tune the degree of influence of higher-order neighbor interactions. We prove that the model converges to a fixed opinion vector, which may differ greatly from those models without higher-order interactions. Since direct computation of the equilibrium opinion is computationally expensive, which involves the operations of huge-scale matrix multiplication and inversion, we design a theoretically convergence-guaranteed estimation algorithm that approximates the equilibrium opinion vector nearly linearly in both space and time with respect to the number of edges in the graph. We conduct extensive experiments on various social networks, demonstrating that the new algorithm is both highly efficient and effective.

在学术文献中，意见分享和形成的问题受到了广泛关注，并提出了一些模型来研究这一问题。然而，现有的模型仅限于最近邻居之间的互动，而那些第二、第三和更高阶的邻居只是被间接地考虑在内，尽管事实上更高阶的互动在真实的社交网络中经常发生。在本文中，我们开发了一种新的舆论动态模型，它基于高阶随机游走，纳入了长程互动，可以明确调整高阶邻居互动的影响程度。我们证明，该模型收敛于一个固定的舆论向量，这可能与那些没有高阶互动的模型有很大不同。由于直接计算均衡意见的计算成本很高，其中涉及大规模矩阵乘法和反转操作，因此我们设计了一种理论上保证收敛的估计算法，该算法能在空间和时间上近似地得到与图中边的数量成线性关系的均衡意见向量。我们在各种社交网络上进行了大量实验，证明新算法既高效又有效。

{"title":"Opinion dynamics in social networks incorporating higher-order interactions","authors":"Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen","doi":"10.1007/s10618-024-01064-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01064-5","url":null,"abstract":"The issue of opinion sharing and formation has received considerable attention in the academic literature, and a few models have been proposed to study this problem. However, existing models are limited to the interactions among nearest neighbors, with those second, third, and higher-order neighbors only considered indirectly, despite the fact that higher-order interactions occur frequently in real social networks. In this paper, we develop a new model for opinion dynamics by incorporating long-range interactions based on higher-order random walks that can explicitly tune the degree of influence of higher-order neighbor interactions. We prove that the model converges to a fixed opinion vector, which may differ greatly from those models without higher-order interactions. Since direct computation of the equilibrium opinion is computationally expensive, which involves the operations of huge-scale matrix multiplication and inversion, we design a theoretically convergence-guaranteed estimation algorithm that approximates the equilibrium opinion vector nearly linearly in both space and time with respect to the number of edges in the graph. We conduct extensive experiments on various social networks, demonstrating that the new algorithm is both highly efficient and effective.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"138 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Model-agnostic variable importance for predictive uncertainty: an entropy-based approach 预测不确定性的模型无关变量重要性：基于熵的方法

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-29 DOI: 10.1007/s10618-024-01070-7

Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger

In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model’s level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model’s predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.

要信任机器学习算法的预测，就必须了解促成这些预测的因素。就概率和不确定性感知模型而言，不仅需要了解预测本身的原因，还需要了解模型对这些预测的置信度的原因。在本文中，我们展示了如何将现有的可解释性方法扩展到不确定性感知模型，以及如何利用这种扩展来理解模型预测分布中的不确定性来源。特别是，通过调整置换特征重要性、部分依赖图和单个条件期望图，我们证明可以获得对模型行为的新见解，并证明这些方法可用于测量特征对预测分布的熵和该分布下地面实况标签的对数似然的影响。通过使用合成数据和真实世界数据进行实验，我们证明了这些方法在了解不确定性来源及其对模型性能的影响方面的实用性。

{"title":"Model-agnostic variable importance for predictive uncertainty: an entropy-based approach","authors":"Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger","doi":"10.1007/s10618-024-01070-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01070-7","url":null,"abstract":"In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model’s level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model’s predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels Detach-ROCKET：利用随机卷积核进行时间序列分类的顺序特征选择

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-20 DOI: 10.1007/s10618-024-01062-7

Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén

Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket.

时间序列分类（TSC）在医学、环境科学和金融等领域至关重要，可以完成疾病诊断、异常检测和股价分析等任务。虽然递归神经网络和 InceptionTime 等机器学习模型在许多应用中都取得了成功，但它们可能会因计算要求而面临可扩展性问题。最近，ROCKET 成为了一种高效的替代方法，它利用从时间序列数据中随机生成的大量特征，实现了最先进的性能并简化了训练。然而，这些特征中有许多是冗余的或非信息性的，从而增加了计算负荷，影响了泛化效果。在此，我们引入了序列特征分离（SFD）技术，用于识别和修剪基于 ROCKET 的模型（如 ROCKET、MiniRocket 和 MultiRocket）中的非必要特征。SFD 使用模型系数估算特征的重要性，无需复杂的超参数调整即可处理大型特征集。对 UCR 档案的测试表明，SFD 只需使用原始特征的 10%，就能生成测试精度更高的模型。我们将这些剪枝模型命名为 Detach-ROCKET。我们还提出了一种端到端的程序，用于确定特征数量与模型准确性之间的最佳平衡。在最大的二进制 UCR 数据集上，Detach-ROCKET 将测试准确率提高了 0.6%，同时减少了 98.9% 的特征。通过在不牺牲准确性的情况下大幅缩小模型规模，我们的方法提高了计算效率，并有助于模型的可解释性。我们相信，Detach-ROCKET 将成为处理时间序列数据的研究人员和从业人员的宝贵工具，他们可以在 https://github.com/gon-uri/detach_rocket 找到该模型的用户友好型实现。

{"title":"Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels","authors":"Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén","doi":"10.1007/s10618-024-01062-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01062-7","url":null,"abstract":"Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian network Motifs for reasoning over heterogeneous unlinked datasets 用于推理异构非链接数据集的贝叶斯网络动机

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-17 DOI: 10.1007/s10618-024-01054-7

Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver

Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.

现代面向数据的应用程序通常需要整合来自多个异构来源的数据。当这些数据集共享属性，但没有其他链接时，就无法将它们连接起来，也就无法明确地在单个层面上进行推理。然而，正如我们在本研究中所展示的，这并不妨碍对这些异构数据集进行概率推理，即使数据和共享属性表现出现实世界数据中常见的严重不匹配。不同的数据集具有不同的样本偏差，在类别定义和空间表示上存在分歧，在不同的时间间隔收集数据，并将总体数据与个体数据混合在一起。在这项工作中，我们展示了一组贝叶斯网络主题如何在一个可组合框架中解决所有这些不匹配问题，该框架允许对所有数据集进行联合概率推理，而无需操作、修改或归因原始数据，从而避免了潜在的有害假设。我们提供了一个开源 Python 工具，该工具封装了我们的方法，并在一些实际应用案例中演示了这一工具。

{"title":"Bayesian network Motifs for reasoning over heterogeneous unlinked datasets","authors":"Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver","doi":"10.1007/s10618-024-01054-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01054-7","url":null,"abstract":"Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"125 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regularization-based methods for ordinal quantification 基于正则化的序量化方法

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-15 DOI: 10.1007/s10618-024-01067-2

Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz

Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of (n>2) classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.

量化，即预测未标记数据项袋中类别流行值的任务，近年来受到越来越多的关注。然而，大多数量化研究都集中在开发二元和多类问题的算法上，在这些问题中，类是没有排序的。在这里，我们研究的是序数情况，即在类（n>2）集合上定义了总序的情况。我们对这一领域有三个主要贡献。首先，我们创建并提供了两个用于序量化（OQ）研究的数据集，克服了之前可用数据集的不足。其次，我们通过实验比较了迄今为止文献中提出的最重要的 OQ 算法。为此，我们汇集了来自数据挖掘和天体物理学等不同研究领域的作者提出的算法，这些作者并不了解彼此的发展情况。第三，我们提出了一类新型正则化 OQ 算法，在实验中表现优于现有算法。性能提升的关键在于，我们的正则化可以防止顺序上难以置信的估计，假设顺序分布在实践中趋于平稳。我们在几个实际应用中非正式地验证了这一假设。

{"title":"Regularization-based methods for ordinal quantification","authors":"Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz","doi":"10.1007/s10618-024-01067-2","DOIUrl":"https://doi.org/10.1007/s10618-024-01067-2","url":null,"abstract":"Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of (n>2) classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"75 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0