首页 > 最新文献

Data Mining and Knowledge Discovery最新文献

英文 中文
Missing value replacement in strings and applications. 在字符串和应用程序中缺少值替换。
IF 2.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-01 Epub Date: 2025-01-22 DOI: 10.1007/s10618-024-01074-3
Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering

Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.

在现实世界的序列(字符串)数据集中,由于以下原因经常出现缺失值:(1)不精确的数据测量;(2)灵活的序列建模,如分子序列的结合谱;或者(3)数据集中存在机密信息,为保护隐私而被故意删除。为了分析这样的数据集,通常重要的是用一个或多个有效的字母替换每个缺失的值,以一种高效和有效的方式。在这里,我们将此任务形式化为组合优化问题:约束集包括缺失值的上下文(即其附近)以及用户定义的禁止模式的有限集,建模,例如,不可信或机密模式;目标函数寻求最小化我们引入的新字母的数量。从算法上讲,我们的问题转化为在包含表示禁止模式的禁止边的特殊图中找到最短路径。我们的工作做出了以下贡献:(1)我们设计了一个线性时间算法来解决恒定长度字母上字符串的这个问题;(2)我们展示了我们的算法如何在存在一组固定长度的禁止模式的情况下毫不费力地应用于完全净化私有字符串[Bernardini et al. 2021a];(3)我们提出了一种利用我们的算法和有效且高效的可计算距离度量对私有字符串集合进行消毒和聚类的方法;(4)我们提供了大量的实验结果,表明我们的方法可以有效地清理私有字符串集合,同时保持聚类质量,优于现有的技术和基线。为了得到我们的理论结果,我们使用了形式语言和组合模式匹配的技术。
{"title":"Missing value replacement in strings and applications.","authors":"Giulia Bernardini, Chang Liu, Grigorios Loukides, Alberto Marchetti-Spaccamela, Solon P Pissis, Leen Stougie, Michelle Sweering","doi":"10.1007/s10618-024-01074-3","DOIUrl":"10.1007/s10618-024-01074-3","url":null,"abstract":"<p><p>Missing values arise routinely in real-world sequential (string) datasets due to: (1) imprecise data measurements; (2) flexible sequence modeling, such as binding profiles of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more <i>valid</i> letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the <i>context</i> of the missing value (i.e., its vicinity) as well as a finite set of user-defined <i>forbidden</i> patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to <i>minimize the number of new letters</i> we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain <i>forbidden edges</i> representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to <i>fully</i> sanitize a private string in the presence of a set of fixed-length forbidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 2","pages":"12"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11754389/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143048707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection and evaluation of clusters within sequential data. 序列数据中聚类的检测和评估。
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-01 Epub Date: 2025-08-14 DOI: 10.1007/s10618-025-01140-4
Alexander Van Werde, Albert Senen-Cerda, Gianluca Kosmella, Jaron Sanders

Sequential data is ubiquitous-it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.

Supplementary information: The online version contains supplementary material available at 10.1007/s10618-025-01140-4.

顺序数据是无处不在的,它通常被收集来洞察复杂的过程,如行为、生物或物理过程。具有挑战性的是,这些数据不仅在观测序列中具有依赖性,而且观测结果通常是高维的、稀疏的和有噪声的。这些困难都掩盖了正在研究的复杂过程的内部工作原理。一种解决方案是计算描述复杂过程(特征)的低维表示。然后,这种表示可以作为代理来深入了解原始流程。然而,由于依赖关系,在序列数据中发现这种低维表示是不平凡的,并且需要为序列专门设计的算法来保证估计器的一致性。幸运的是,最近关于块马尔可夫链的理论进展已经产生了新的聚类算法,可以在合成序列数据中证明这一点。本文首次在实际序列数据中对这些新算法进行了实地研究;在一系列数据序列中对聚类进行广泛的实证研究。我们广泛地研究了当给定现实生活中复杂过程的稀疏高维序列数据时,是否可以使用这些算法提取有用的低维表示。具体地说,我们研究了包含描述动物运动的GPS坐标、人类DNA链、英语写作文本和金融市场每日收益的数据序列。我们发现的低维表示不仅可以成功地对数据的顺序结构进行编码,还可以获得对潜在复杂过程的新见解。补充信息:在线版本包含补充资料,可在10.1007/s10618-025-01140-4获得。
{"title":"Detection and evaluation of clusters within sequential data.","authors":"Alexander Van Werde, Albert Senen-Cerda, Gianluca Kosmella, Jaron Sanders","doi":"10.1007/s10618-025-01140-4","DOIUrl":"10.1007/s10618-025-01140-4","url":null,"abstract":"<p><p>Sequential data is ubiquitous-it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s10618-025-01140-4.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"39 6","pages":"69"},"PeriodicalIF":4.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12354125/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144876831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FRUITS: feature extraction using iterated sums for time series classification FRUITS:利用迭代和进行时间序列分类的特征提取
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-14 DOI: 10.1007/s10618-024-01068-1
Joscha Diehl, Richard Krieg

We introduce a pipeline for time series classification that extracts features based on the iterated-sums signature (ISS) and then applies a linear classifier. These features are intrinsically nonlinear, capture chronological information, and, under certain settings, are invariant to a form of time-warping. We achieve competitive results, both in accuracy and speed, on the UCR archive. We make our code available at https://github.com/irkri/fruits.

我们介绍了一种时间序列分类方法,它根据迭和特征(ISS)提取特征,然后应用线性分类器。这些特征本质上是非线性的,能捕捉时间信息,而且在某些设置下,不受某种形式的时间扭曲的影响。我们在 UCR 档案中取得了极具竞争力的准确性和速度。我们在 https://github.com/irkri/fruits 网站上提供了我们的代码。
{"title":"FRUITS: feature extraction using iterated sums for time series classification","authors":"Joscha Diehl, Richard Krieg","doi":"10.1007/s10618-024-01068-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01068-1","url":null,"abstract":"<p>We introduce a pipeline for time series classification that extracts features based on the iterated-sums signature (ISS) and then applies a linear classifier. These features are intrinsically nonlinear, capture chronological information, and, under certain settings, are invariant to a form of time-warping. We achieve competitive results, both in accuracy and speed, on the UCR archive. We make our code available at https://github.com/irkri/fruits.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"32 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bounding the family-wise error rate in local causal discovery using Rademacher averages 利用拉德马赫平均值限定局部因果发现中的族内误差率
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-09 DOI: 10.1007/s10618-024-01069-0
Dario Simionato, Fabio Vandin

Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.

人们提出了许多算法来学习观察数据中目标变量周围的局部图形结构,重点是两组变量。第一个变量集称为父子变量集(PC),包含所有与目标变量直接相关的变量;第二个变量集称为马尔可夫边界变量集(MB),是对目标变量具有最佳预测性能的最小变量集。在本文中,我们针对 PC 和 MB 发现任务介绍了两种新型算法,它们都能严格保证全族误差率(FWER),即输出中报告任何假阳性的概率。我们的算法使用了统计学习理论中的一个关键概念--拉德马赫平均值,以适当考虑此类任务中出现的多重假设检验问题。我们在模拟数据上进行的评估表明,我们的算法能正确控制 FWER,而广泛使用的算法即使对多重假设检验进行了校正,也不能保证不会出现错误发现。我们的实验还表明,我们的算法能识别真实世界数据中的有意义关系。
{"title":"Bounding the family-wise error rate in local causal discovery using Rademacher averages","authors":"Dario Simionato, Fabio Vandin","doi":"10.1007/s10618-024-01069-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01069-0","url":null,"abstract":"<p>Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack 通过基于机器学习的再识别攻击评估匿名文件的披露风险
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-03 DOI: 10.1007/s10618-024-01066-3
Benet Manzanares-Salor, David Sánchez, Pierre Lison

The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.

对于许多数据挖掘和机器学习任务来说,提供描述以人为中心的特征和行为的文本数据至关重要。然而,包含个人信息的数据在二次使用前应进行匿名处理。过去几年中提出了多种文本匿名化方法,这些方法的标准评估方法是将其输出结果与基于人的匿名化方法进行比较。残余披露风险是用召回率指标来估算的,它量化了匿名算法成功检测到的人工注释的重新识别术语的比例。然而,召回率并不是一种风险度量,它有几个缺点。首先,它需要一个唯一的基本事实,而这对于文本匿名化来说并不成立,因为在文本匿名化中,有几种掩码选择可能同样有效,以防止重新识别。其次,它依赖于人的判断,而人的判断本身是主观的,容易出错。最后,召回度量对术语的加权是统一的,因此忽略了这样一个事实,即某些遗漏术语对披露风险的影响可能比其他术语大得多。为了克服这些缺点,我们在本文中提出了一种新方法,通过自动再识别攻击来评估匿名文本的披露风险。我们将攻击形式化为多类分类任务,并利用最先进的神经语言模型来汇总攻击者可能用于构建分类器的数据源。我们通过评估几种文本匿名化方法在不同攻击配置下的泄露风险来说明我们方法的有效性。实证结果表明,大多数现有的匿名化方法都存在很大的隐私风险。
{"title":"Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack","authors":"Benet Manzanares-Salor, David Sánchez, Pierre Lison","doi":"10.1007/s10618-024-01066-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01066-3","url":null,"abstract":"<p>The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient learning with projected histograms 利用投影直方图进行高效学习
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-01 DOI: 10.1007/s10618-024-01063-6
Zhanliang Huang, Ata Kabán, Henry Reeve

High dimensional learning is a perennial problem due to challenges posed by the “curse of dimensionality”; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings.

由于 "维度诅咒 "带来的挑战,高维学习是一个长期存在的问题;学习通常需要更多的计算资源和更多的训练数据。在差异化隐私(DP)设置中,为了达到所需的隐私性,需要在每个维度上添加噪声,这进一步加剧了问题的严重性。在本文中,我们提出了一种基于数据低维随机投影 (RP) 构建的直方图的简单方法,可以一次性解决所有这些问题。我们的方法利用 RP 来利用数据中隐藏的低维结构,既提高了计算效率,又改善了与样本大小相关的误差收敛性--在这种情况下,只需较少的训练数据即可进行学习。我们还提出了一种高效差异化隐私(DP)分类的变体,它进一步利用了直方图构建和基于 RP 的降维的数据无关性,从而有效地管理了隐私预算。我们对算法在几种环境下的泛化进行了详细而严谨的理论分析,表明我们的方法能够利用数据的低维结构,改善隐私所需的噪声不良影响,并在最低条件下具有良好的泛化能力。我们还通过实验证实了我们的发现,并证明我们的算法在非隐私和隐私环境下都能达到具有竞争力的分类准确性。
{"title":"Efficient learning with projected histograms","authors":"Zhanliang Huang, Ata Kabán, Henry Reeve","doi":"10.1007/s10618-024-01063-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01063-6","url":null,"abstract":"<p>High dimensional learning is a perennial problem due to challenges posed by the “curse of dimensionality”; learning typically demands more computing resources as well as more training data. In differentially private (DP) settings, this is further exacerbated by noise that needs adding to each dimension to achieve the required privacy. In this paper, we present a surprisingly simple approach to address all of these concerns at once, based on histograms constructed on a low-dimensional random projection (RP) of the data. Our approach exploits RP to take advantage of hidden low-dimensional structures in the data, yielding both computational efficiency, and improved error convergence with respect to the sample size—whereby less training data suffice for learning. We also propose a variant for efficient differentially private (DP) classification that further exploits the data-oblivious nature of both the histogram construction and the RP based dimensionality reduction, resulting in an efficient management of the privacy budget. We present a detailed and rigorous theoretical analysis of generalisation of our algorithms in several settings, showing that our approach is able to exploit low-dimensional structure of the data, ameliorates the ill-effects of noise required for privacy, and has good generalisation under minimal conditions. We also corroborate our findings experimentally, and demonstrate that our algorithms achieve competitive classification accuracy in both non-private and private settings.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Opinion dynamics in social networks incorporating higher-order interactions 包含高阶互动的社交网络中的舆论动态
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-30 DOI: 10.1007/s10618-024-01064-5
Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen

The issue of opinion sharing and formation has received considerable attention in the academic literature, and a few models have been proposed to study this problem. However, existing models are limited to the interactions among nearest neighbors, with those second, third, and higher-order neighbors only considered indirectly, despite the fact that higher-order interactions occur frequently in real social networks. In this paper, we develop a new model for opinion dynamics by incorporating long-range interactions based on higher-order random walks that can explicitly tune the degree of influence of higher-order neighbor interactions. We prove that the model converges to a fixed opinion vector, which may differ greatly from those models without higher-order interactions. Since direct computation of the equilibrium opinion is computationally expensive, which involves the operations of huge-scale matrix multiplication and inversion, we design a theoretically convergence-guaranteed estimation algorithm that approximates the equilibrium opinion vector nearly linearly in both space and time with respect to the number of edges in the graph. We conduct extensive experiments on various social networks, demonstrating that the new algorithm is both highly efficient and effective.

在学术文献中,意见分享和形成的问题受到了广泛关注,并提出了一些模型来研究这一问题。然而,现有的模型仅限于最近邻居之间的互动,而那些第二、第三和更高阶的邻居只是被间接地考虑在内,尽管事实上更高阶的互动在真实的社交网络中经常发生。在本文中,我们开发了一种新的舆论动态模型,它基于高阶随机游走,纳入了长程互动,可以明确调整高阶邻居互动的影响程度。我们证明,该模型收敛于一个固定的舆论向量,这可能与那些没有高阶互动的模型有很大不同。由于直接计算均衡意见的计算成本很高,其中涉及大规模矩阵乘法和反转操作,因此我们设计了一种理论上保证收敛的估计算法,该算法能在空间和时间上近似地得到与图中边的数量成线性关系的均衡意见向量。我们在各种社交网络上进行了大量实验,证明新算法既高效又有效。
{"title":"Opinion dynamics in social networks incorporating higher-order interactions","authors":"Zuobai Zhang, Wanyue Xu, Zhongzhi Zhang, Guanrong Chen","doi":"10.1007/s10618-024-01064-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01064-5","url":null,"abstract":"<p>The issue of opinion sharing and formation has received considerable attention in the academic literature, and a few models have been proposed to study this problem. However, existing models are limited to the interactions among nearest neighbors, with those second, third, and higher-order neighbors only considered indirectly, despite the fact that higher-order interactions occur frequently in real social networks. In this paper, we develop a new model for opinion dynamics by incorporating long-range interactions based on higher-order random walks that can explicitly tune the degree of influence of higher-order neighbor interactions. We prove that the model converges to a fixed opinion vector, which may differ greatly from those models without higher-order interactions. Since direct computation of the equilibrium opinion is computationally expensive, which involves the operations of huge-scale matrix multiplication and inversion, we design a theoretically convergence-guaranteed estimation algorithm that approximates the equilibrium opinion vector nearly linearly in both space and time with respect to the number of edges in the graph. We conduct extensive experiments on various social networks, demonstrating that the new algorithm is both highly efficient and effective.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"138 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-agnostic variable importance for predictive uncertainty: an entropy-based approach 预测不确定性的模型无关变量重要性:基于熵的方法
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-29 DOI: 10.1007/s10618-024-01070-7
Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger

In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model’s level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model’s predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.

要信任机器学习算法的预测,就必须了解促成这些预测的因素。就概率和不确定性感知模型而言,不仅需要了解预测本身的原因,还需要了解模型对这些预测的置信度的原因。在本文中,我们展示了如何将现有的可解释性方法扩展到不确定性感知模型,以及如何利用这种扩展来理解模型预测分布中的不确定性来源。特别是,通过调整置换特征重要性、部分依赖图和单个条件期望图,我们证明可以获得对模型行为的新见解,并证明这些方法可用于测量特征对预测分布的熵和该分布下地面实况标签的对数似然的影响。通过使用合成数据和真实世界数据进行实验,我们证明了这些方法在了解不确定性来源及其对模型性能的影响方面的实用性。
{"title":"Model-agnostic variable importance for predictive uncertainty: an entropy-based approach","authors":"Danny Wood, Theodore Papamarkou, Matt Benatan, Richard Allmendinger","doi":"10.1007/s10618-024-01070-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01070-7","url":null,"abstract":"<p>In order to trust the predictions of a machine learning algorithm, it is necessary to understand the factors that contribute to those predictions. In the case of probabilistic and uncertainty-aware models, it is necessary to understand not only the reasons for the predictions themselves, but also the reasons for the model’s level of confidence in those predictions. In this paper, we show how existing methods in explainability can be extended to uncertainty-aware models and how such extensions can be used to understand the sources of uncertainty in a model’s predictive distribution. In particular, by adapting permutation feature importance, partial dependence plots, and individual conditional expectation plots, we demonstrate that novel insights into model behaviour may be obtained and that these methods can be used to measure the impact of features on both the entropy of the predictive distribution and the log-likelihood of the ground truth labels under that distribution. With experiments using both synthetic and real-world data, we demonstrate the utility of these approaches to understand both the sources of uncertainty and their impact on model performance.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels Detach-ROCKET:利用随机卷积核进行时间序列分类的顺序特征选择
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-20 DOI: 10.1007/s10618-024-01062-7
Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén

Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket.

时间序列分类(TSC)在医学、环境科学和金融等领域至关重要,可以完成疾病诊断、异常检测和股价分析等任务。虽然递归神经网络和 InceptionTime 等机器学习模型在许多应用中都取得了成功,但它们可能会因计算要求而面临可扩展性问题。最近,ROCKET 成为了一种高效的替代方法,它利用从时间序列数据中随机生成的大量特征,实现了最先进的性能并简化了训练。然而,这些特征中有许多是冗余的或非信息性的,从而增加了计算负荷,影响了泛化效果。在此,我们引入了序列特征分离(SFD)技术,用于识别和修剪基于 ROCKET 的模型(如 ROCKET、MiniRocket 和 MultiRocket)中的非必要特征。SFD 使用模型系数估算特征的重要性,无需复杂的超参数调整即可处理大型特征集。对 UCR 档案的测试表明,SFD 只需使用原始特征的 10%,就能生成测试精度更高的模型。我们将这些剪枝模型命名为 Detach-ROCKET。我们还提出了一种端到端的程序,用于确定特征数量与模型准确性之间的最佳平衡。在最大的二进制 UCR 数据集上,Detach-ROCKET 将测试准确率提高了 0.6%,同时减少了 98.9% 的特征。通过在不牺牲准确性的情况下大幅缩小模型规模,我们的方法提高了计算效率,并有助于模型的可解释性。我们相信,Detach-ROCKET 将成为处理时间序列数据的研究人员和从业人员的宝贵工具,他们可以在 https://github.com/gon-uri/detach_rocket 找到该模型的用户友好型实现。
{"title":"Detach-ROCKET: sequential feature selection for time series classification with random convolutional kernels","authors":"Gonzalo Uribarri, Federico Barone, Alessio Ansuini, Erik Fransén","doi":"10.1007/s10618-024-01062-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01062-7","url":null,"abstract":"<p>Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6% while reducing features by 98.9%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at https://github.com/gon-uri/detach_rocket.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian network Motifs for reasoning over heterogeneous unlinked datasets 用于推理异构非链接数据集的贝叶斯网络动机
IF 4.8 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-17 DOI: 10.1007/s10618-024-01054-7
Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver

Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.

现代面向数据的应用程序通常需要整合来自多个异构来源的数据。当这些数据集共享属性,但没有其他链接时,就无法将它们连接起来,也就无法明确地在单个层面上进行推理。然而,正如我们在本研究中所展示的,这并不妨碍对这些异构数据集进行概率推理,即使数据和共享属性表现出现实世界数据中常见的严重不匹配。不同的数据集具有不同的样本偏差,在类别定义和空间表示上存在分歧,在不同的时间间隔收集数据,并将总体数据与个体数据混合在一起。在这项工作中,我们展示了一组贝叶斯网络主题如何在一个可组合框架中解决所有这些不匹配问题,该框架允许对所有数据集进行联合概率推理,而无需操作、修改或归因原始数据,从而避免了潜在的有害假设。我们提供了一个开源 Python 工具,该工具封装了我们的方法,并在一些实际应用案例中演示了这一工具。
{"title":"Bayesian network Motifs for reasoning over heterogeneous unlinked datasets","authors":"Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver","doi":"10.1007/s10618-024-01054-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01054-7","url":null,"abstract":"<p>Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"125 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Data Mining and Knowledge Discovery
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1