Data Mining and Knowledge Discovery最新文献_第6页

Binary quantification and dataset shift: an experimental investigation 二元量化与数据集转移：一项实验研究

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-03-18 DOI: 10.1007/s10618-024-01014-1

Pablo González, Alejandro Moreo, Fabrizio Sebastiani

Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.

量化是一种监督学习任务，包括训练未标记数据集的类别流行值预测器，当预测器所训练的标记数据和未标记数据不是 IID 时，即出现数据集偏移时，量化就会引起特别的兴趣。迄今为止，量化方法大多只在数据集偏移的一种特殊情况下（即先验概率偏移）进行过测试；量化与其他类型的数据集偏移之间的关系基本上仍未得到探讨。在这项工作中，我们对当前的量化算法在不同类型的数据集偏移下的表现进行了实验分析，以找出当前方法的局限性，并希望为开发更广泛适用的方法铺平道路。为此，我们提出了数据集偏移类型的精细分类法，建立了生成受这些类型偏移影响的数据集的协议，并在由此生成的数据集上测试了现有的量化方法。这项调查得出的一个发现是，许多现有的量化方法对先验概率偏移具有鲁棒性，但对其他类型的数据集偏移并不一定具有鲁棒性。第二个发现是，现有的量化方法似乎都不足以应对我们在实验中模拟的所有类型的数据集偏移。重现我们所有实验所需的代码可在 https://github.com/pglez82/quant_datasetshift 公开获取。

{"title":"Binary quantification and dataset shift: an experimental investigation","authors":"Pablo González, Alejandro Moreo, Fabrizio Sebastiani","doi":"10.1007/s10618-024-01014-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01014-1","url":null,"abstract":"Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"159 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140167619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online concept evolution detection based on active learning 基于主动学习的在线概念演化检测

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-03-15 DOI: 10.1007/s10618-024-01011-4

Husheng Guo, Hai Li, Lu Cong, Wenjian Wang

Concept evolution detection is an important and difficult problem in streaming data mining. When the labeled samples in streaming data insufficient to reflect the training data distribution, it will often further restrict the detection performance. This paper proposed a concept evolution detection method based on active learning (CE_AL). Firstly, the initial classifiers are constructed by a small number of labeled samples. The sample areas are divided into the automatic labeling and the active labeling areas according to the relationship between the classifiers of different categories. Secondly, for online new coming samples, according to their different areas, two strategies based on the automatic learning-based model labeling and active learning-based expert labeling are adopted respectively, which can improve the online learning performance with only a small number of labeled samples. Besides, the strategy of “data enhance” combined with “model enhance” is adopted to accelerate the convergence of the evolution category detection model. The experimental results show that the proposed CE_AL method can enhance the detection performance of concept evolution and realize efficient learning in an unstable environment by labeling a small number of key samples.

概念演化检测是流数据挖掘中的一个重要而困难的问题。当流数据中的标注样本不足以反映训练数据分布时，往往会进一步限制检测性能。本文提出了一种基于主动学习的概念演化检测方法（CE_AL）。首先，通过少量标注样本构建初始分类器。根据不同类别分类器之间的关系，将样本区域划分为自动标注区域和主动标注区域。其次，对于在线新样本，根据其不同的领域，分别采用基于自动学习的模型标注和基于主动学习的专家标注两种策略，这样可以在只有少量标注样本的情况下提高在线学习性能。此外，还采用了 "数据增强 "与 "模型增强 "相结合的策略，以加速演化类别检测模型的收敛。实验结果表明，所提出的 CE_AL 方法可以提高概念演化的检测性能，并通过标注少量关键样本实现不稳定环境下的高效学习。

{"title":"Online concept evolution detection based on active learning","authors":"Husheng Guo, Hai Li, Lu Cong, Wenjian Wang","doi":"10.1007/s10618-024-01011-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01011-4","url":null,"abstract":"Concept evolution detection is an important and difficult problem in streaming data mining. When the labeled samples in streaming data insufficient to reflect the training data distribution, it will often further restrict the detection performance. This paper proposed a concept evolution detection method based on active learning (CE_AL). Firstly, the initial classifiers are constructed by a small number of labeled samples. The sample areas are divided into the automatic labeling and the active labeling areas according to the relationship between the classifiers of different categories. Secondly, for online new coming samples, according to their different areas, two strategies based on the automatic learning-based model labeling and active learning-based expert labeling are adopted respectively, which can improve the online learning performance with only a small number of labeled samples. Besides, the strategy of “data enhance” combined with “model enhance” is adopted to accelerate the convergence of the evolution category detection model. The experimental results show that the proposed CE_AL method can enhance the detection performance of concept evolution and realize efficient learning in an unstable environment by labeling a small number of key samples.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"23 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Marginal effects for non-linear prediction functions 非线性预测函数的边际效应

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-27 DOI: 10.1007/s10618-023-00993-x

Christian A. Scholbeck, Giuseppe Casalicchio, Christoph Molnar, Bernd Bischl, Christian Heumann

Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models such as generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either as derivatives of the prediction function or forward differences in prediction due to changes in feature values. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a general model-agnostic interpretation method for machine learning models. This may stem from the ambiguity surrounding marginal effects and their inability to deal with the non-linearities found in black box models. We introduce a unified definition of forward marginal effects (FMEs) that includes univariate and multivariate, as well as continuous, categorical, and mixed-type features. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for FMEs. Furthermore, we argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to average homogeneous FMEs within population subgroups, which serve as conditional feature effect estimates.

线性回归模型的 Beta 系数是可解释特征效应的理想形式。然而，对于广义线性模型等非线性模型，估计系数不能被解释为对预测结果的直接特征效应。因此，边际效应通常被用作特征效应的近似值，或者是预测函数的导数，或者是特征值变化导致的预测结果的前向差异。虽然边际效应在很多科学领域都得到了普遍应用，但它们尚未被采用为机器学习模型的通用模型解释方法。这可能源于边际效应的模糊性，以及边际效应无法处理黑盒模型中的非线性问题。我们引入了前向边际效应（FMEs）的统一定义，其中包括单变量和多变量，以及连续、分类和混合型特征。为了考虑预测函数的非线性，我们为前向边际效应引入了非线性度量。此外，我们反对用平均边际效应等单一指标来概括非线性预测函数的特征效应。相反，我们建议在人群子群中平均同质的 FMEs，作为条件特征效应估计值。

{"title":"Marginal effects for non-linear prediction functions","authors":"Christian A. Scholbeck, Giuseppe Casalicchio, Christoph Molnar, Bernd Bischl, Christian Heumann","doi":"10.1007/s10618-023-00993-x","DOIUrl":"https://doi.org/10.1007/s10618-023-00993-x","url":null,"abstract":"Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models such as generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either as derivatives of the prediction function or forward differences in prediction due to changes in feature values. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a general model-agnostic interpretation method for machine learning models. This may stem from the ambiguity surrounding marginal effects and their inability to deal with the non-linearities found in black box models. We introduce a unified definition of forward marginal effects (FMEs) that includes univariate and multivariate, as well as continuous, categorical, and mixed-type features. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for FMEs. Furthermore, we argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to average homogeneous FMEs within population subgroups, which serve as conditional feature effect estimates.\u0000","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"476 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140010901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning a Bayesian network with multiple latent variables for implicit relation representation 学习具有多个潜在变量的贝叶斯网络，实现隐式关系表征

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-22 DOI: 10.1007/s10618-024-01012-3

Xinran Wu, Kun Yue, Liang Duan, Xiaodong Fu

Artificial intelligence applications could be more powerful and comprehensive by incorporating the ability of inference, which could be achieved by probabilistic inference over implicit relations. It is significant yet challenging to represent implicit relations among observed variables and latent ones like disease etiologies and user preferences. In this paper, we propose the BN with multiple latent variables (MLBN) as the framework for representing the dependence relations, where multiple latent variables are incorporated to describe multi-dimensional abstract concepts. However, the efficiency of MLBN learning and effectiveness of MLBN based applications are still nontrivial due to the presence of multiple latent variables. To this end, we first propose the constraint induced and Spark based algorithm for MLBN learning, as well as several optimization strategies. Moreover, we present the concept of variation degree and further design a subgraph based algorithm for incremental learning of MLBN. Experimental results suggest that our proposed MLBN model could represent the dependence relations correctly. Our proposed method outperforms some state-of-the-art competitors for personalized recommendation, and facilitates some typical approaches to achieve better performance.

人工智能应用可以通过结合推理能力变得更加强大和全面，而推理能力可以通过对隐含关系的概率推理来实现。要表示观察变量和潜在变量（如疾病病因和用户偏好）之间的隐含关系，意义重大却又充满挑战。在本文中，我们提出了具有多个潜变量的 BN（MLBN）作为表示依赖关系的框架，其中纳入了多个潜变量来描述多维抽象概念。然而，由于存在多个潜变量，MLBN 学习的效率和基于 MLBN 的应用的有效性仍是个难题。为此，我们首先提出了 MLBN 学习的约束诱导算法和基于 Spark 的算法，以及几种优化策略。此外，我们还提出了变异度的概念，并进一步设计了基于子图的 MLBN 增量学习算法。实验结果表明，我们提出的 MLBN 模型可以正确地表示依赖关系。在个性化推荐方面，我们提出的方法优于一些最先进的竞争对手，并有助于一些典型方法取得更好的性能。

{"title":"Learning a Bayesian network with multiple latent variables for implicit relation representation","authors":"Xinran Wu, Kun Yue, Liang Duan, Xiaodong Fu","doi":"10.1007/s10618-024-01012-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01012-3","url":null,"abstract":"Artificial intelligence applications could be more powerful and comprehensive by incorporating the ability of inference, which could be achieved by probabilistic inference over implicit relations. It is significant yet challenging to represent implicit relations among observed variables and latent ones like disease etiologies and user preferences. In this paper, we propose the BN with multiple latent variables (MLBN) as the framework for representing the dependence relations, where multiple latent variables are incorporated to describe multi-dimensional abstract concepts. However, the efficiency of MLBN learning and effectiveness of MLBN based applications are still nontrivial due to the presence of multiple latent variables. To this end, we first propose the constraint induced and Spark based algorithm for MLBN learning, as well as several optimization strategies. Moreover, we present the concept of variation degree and further design a subgraph based algorithm for incremental learning of MLBN. Experimental results suggest that our proposed MLBN model could represent the dependence relations correctly. Our proposed method outperforms some state-of-the-art competitors for personalized recommendation, and facilitates some typical approaches to achieve better performance.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"94 24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MMA: metadata supported multi-variate attention for onset detection and prediction MMA：元数据支持的多变量关注，用于发病检测和预测

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-19 DOI: 10.1007/s10618-024-01008-z

Manjusha Ravindranath, K. Selçuk Candan, Maria Luisa Sapino, Brian Appavu

Deep learning has been applied successfully in sequence understanding and translation problems, especially in univariate, unimodal contexts, where large number of supervision data are available. The effectiveness of deep learning in more complex (multi-modal, multi-variate) contexts, where supervision data is rare, however, is generally not satisfactory. In this paper, we focus on improving detection and prediction accuracy in precisely such contexts – in particular, we focus on the problem of predicting seizure onsets relying on multi-modal (EEG, ICP, ECG, and ABP) sensory data streams, some of which (such as EEG) are inherently multi-variate due to the placement of multiple sensors to capture spatial distribution of the relevant signals. In particular, we note that multi-variate time series often carry robust, spatio-temporally localized features that could help predict onset events. We further argue that such features can be used to support implementation of metadata supported multivariate attention (or MMA) mechanisms that help significantly improve the effectiveness of neural networks architectures. In this paper, we use the proposed MMA approach to develop a multi-modal LSTM-based neural network architecture to tackle seizure onset detection and prediction tasks relying on EEG, ICP, ECG, and ABP data streams. We experimentally evaluate the proposed architecture under different scenarios – the results illustrate the effectiveness of the proposed attention mechanism, especially compared against other metadata driven competitors.

深度学习已成功应用于序列理解和翻译问题，尤其是在有大量监督数据的单变量、单模态环境中。然而，在监督数据稀少的更复杂（多模态、多变量）环境中，深度学习的效果通常并不令人满意。在本文中，我们关注的正是在此类情况下提高检测和预测准确性的问题--尤其是，我们关注的是依靠多模态（EEG、ICP、ECG 和 ABP）感官数据流预测癫痫发作的问题，其中一些数据流（如 EEG）由于放置了多个传感器以捕捉相关信号的空间分布，本身就是多变量的。我们特别注意到，多变量时间序列通常具有稳健的时空定位特征，有助于预测发病事件。我们进一步认为，这些特征可用于支持元数据支持的多变量关注（或 MMA）机制的实施，从而有助于显著提高神经网络架构的有效性。在本文中，我们使用所提出的 MMA 方法开发了一种基于 LSTM 的多模态神经网络架构，以处理依赖于 EEG、ICP、ECG 和 ABP 数据流的癫痫发作检测和预测任务。我们在不同的场景下对所提出的架构进行了实验评估--结果表明了所提出的注意力机制的有效性，尤其是与其他元数据驱动的竞争者相比。

{"title":"MMA: metadata supported multi-variate attention for onset detection and prediction","authors":"Manjusha Ravindranath, K. Selçuk Candan, Maria Luisa Sapino, Brian Appavu","doi":"10.1007/s10618-024-01008-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01008-z","url":null,"abstract":"Deep learning has been applied successfully in sequence understanding and translation problems, especially in univariate, unimodal contexts, where large number of supervision data are available. The effectiveness of deep learning in more complex (multi-modal, multi-variate) contexts, where supervision data is rare, however, is generally not satisfactory. In this paper, we focus on improving detection and prediction accuracy in precisely such contexts – in particular, we focus on the problem of predicting seizure onsets relying on multi-modal (EEG, ICP, ECG, and ABP) sensory data streams, some of which (such as EEG) are inherently multi-variate due to the placement of multiple sensors to capture spatial distribution of the relevant signals. In particular, we note that multi-variate time series often carry robust, spatio-temporally localized features that could help predict onset events. We further argue that such features can be used to support implementation of metadata supported multivariate attention (or MMA) mechanisms that help significantly improve the effectiveness of neural networks architectures. In this paper, we use the proposed MMA approach to develop a multi-modal LSTM-based neural network architecture to tackle seizure onset detection and prediction tasks relying on EEG, ICP, ECG, and ABP data streams. We experimentally evaluate the proposed architecture under different scenarios – the results illustrate the effectiveness of the proposed attention mechanism, especially compared against other metadata driven competitors.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"7 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139909369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structural learning of simple staged trees 简单分期树木的结构学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-15 DOI: 10.1007/s10618-024-01007-0

Abstract

Bayesian networks faithfully represent the symmetric conditional independences existing between the components of a random vector. Staged trees are an extension of Bayesian networks for categorical random vectors whose graph represents non-symmetric conditional independences via vertex coloring. However, since they are based on a tree representation of the sample space, the underlying graph becomes cluttered and difficult to visualize as the number of variables increases. Here, we introduce the first structural learning algorithms for the class of simple staged trees, entertaining a compact coalescence of the underlying tree from which non-symmetric independences can be easily read. We show that data-learned simple staged trees often outperform Bayesian networks in model fit and illustrate how the coalesced graph is used to identify non-symmetric conditional independences.

摘要贝叶斯网络忠实地表示了随机向量成分之间存在的对称条件独立性。分阶段树是贝叶斯网络对分类随机向量的扩展，其图形通过顶点着色来表示非对称条件独立性。然而，由于它们基于样本空间的树形表示，随着变量数量的增加，底层图变得杂乱无章，难以可视化。在这里，我们介绍了第一种针对简单分期树类的结构学习算法，它将底层树紧凑地凝聚在一起，从中可以轻松读取非对称无关性。我们展示了数据学习的简单分期树在模型拟合方面往往优于贝叶斯网络，并说明了如何利用凝聚图来识别非对称条件独立性。

引用次数: 0

Universal representation learning for multivariate time series using the instance-level and cluster-level supervised contrastive learning 利用实例级和聚类级监督对比学习多变量时间序列的通用表示学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-09 DOI: 10.1007/s10618-024-01006-1

Nazanin Moradinasab, Suchetha Sharma, Ronen Bar-Yoseph, Shlomit Radom-Aizik, Kenneth C. Bilchick, Dan M. Cooper, Arthur Weltman, Donald E. Brown

The multivariate time series classification (MTSC) task aims to predict a class label for a given time series. Recently, modern deep learning-based approaches have achieved promising performance over traditional methods for MTSC tasks. The success of these approaches relies on access to the massive amount of labeled data (i.e., annotating or assigning tags to each sample that shows its corresponding category). However, obtaining a massive amount of labeled data is usually very time-consuming and expensive in many real-world applications such as medicine, because it requires domain experts’ knowledge to annotate data. Insufficient labeled data prevents these models from learning discriminative features, resulting in poor margins that reduce generalization performance. To address this challenge, we propose a novel approach: supervised contrastive learning for time series classification (SupCon-TSC). This approach improves the classification performance by learning the discriminative low-dimensional representations of multivariate time series, and its end-to-end structure allows for interpretable outcomes. It is based on supervised contrastive (SupCon) loss to learn the inherent structure of multivariate time series. First, two separate augmentation families, including strong and weak augmentation methods, are utilized to generate augmented data for the source and target networks, respectively. Second, we propose the instance-level, and cluster-level SupCon learning approaches to capture contextual information to learn the discriminative and universal representation for multivariate time series datasets. In the instance-level SupCon learning approach, for each given anchor instance that comes from the source network, the low-variance output encodings from the target network are sampled as positive and negative instances based on their labels. However, the cluster-level approach is performed between each instance and cluster centers among batches, as opposed to the instance-level approach. The cluster-level SupCon loss attempts to maximize the similarities between each instance and cluster centers among batches. We tested this novel approach on two small cardiopulmonary exercise testing (CPET) datasets and the real-world UEA Multivariate time series archive. The results of the SupCon-TSC model on CPET datasets indicate its capability to learn more discriminative features than existing approaches in situations where the size of the dataset is small. Moreover, the results on the UEA archive show that training a classifier on top of the universal representation features learned by our proposed method outperforms the state-of-the-art approaches.

多变量时间序列分类（MTSC）任务旨在预测给定时间序列的类别标签。最近，基于深度学习的现代方法在 MTSC 任务中取得了优于传统方法的性能。这些方法的成功依赖于对海量标注数据的访问（即为每个样本标注或分配标签，以显示其相应的类别）。然而，在医学等许多实际应用中，获取海量标注数据通常非常耗时且昂贵，因为这需要领域专家的知识来标注数据。标注数据不足会阻碍这些模型学习判别特征，导致边缘差，从而降低泛化性能。为了应对这一挑战，我们提出了一种新方法：时间序列分类的监督对比学习（SupCon-TSC）。这种方法通过学习多变量时间序列的低维判别表征来提高分类性能，其端到端的结构允许获得可解释的结果。它基于监督对比（SupCon）损失来学习多变量时间序列的内在结构。首先，利用两个独立的增强系列，包括强增强和弱增强方法，分别为源网络和目标网络生成增强数据。其次，我们提出了实例级和集群级 SupCon 学习方法，以捕捉上下文信息，学习多变量时间序列数据集的判别和通用表示。在实例级 SupCon 学习方法中，对于来自源网络的每个给定锚实例，目标网络的低方差输出编码会根据其标签作为正实例和负实例进行采样。不过，与实例级方法不同，簇级方法是在批次之间的每个实例和簇中心之间执行的。集群级 SupCon loss 试图最大化批次间每个实例与集群中心之间的相似性。我们在两个小型心肺运动测试（CPET）数据集和真实世界的 UEA 多变量时间序列档案中测试了这种新方法。SupCon-TSC 模型在 CPET 数据集上的结果表明，在数据集规模较小的情况下，它能比现有方法学习到更多的判别特征。此外，UEA 档案上的结果表明，在我们提出的方法所学习的通用表示特征基础上训练分类器的效果优于最先进的方法。

{"title":"Universal representation learning for multivariate time series using the instance-level and cluster-level supervised contrastive learning","authors":"Nazanin Moradinasab, Suchetha Sharma, Ronen Bar-Yoseph, Shlomit Radom-Aizik, Kenneth C. Bilchick, Dan M. Cooper, Arthur Weltman, Donald E. Brown","doi":"10.1007/s10618-024-01006-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01006-1","url":null,"abstract":"The multivariate time series classification (MTSC) task aims to predict a class label for a given time series. Recently, modern deep learning-based approaches have achieved promising performance over traditional methods for MTSC tasks. The success of these approaches relies on access to the massive amount of labeled data (i.e., annotating or assigning tags to each sample that shows its corresponding category). However, obtaining a massive amount of labeled data is usually very time-consuming and expensive in many real-world applications such as medicine, because it requires domain experts’ knowledge to annotate data. Insufficient labeled data prevents these models from learning discriminative features, resulting in poor margins that reduce generalization performance. To address this challenge, we propose a novel approach: supervised contrastive learning for time series classification (SupCon-TSC). This approach improves the classification performance by learning the discriminative low-dimensional representations of multivariate time series, and its end-to-end structure allows for interpretable outcomes. It is based on supervised contrastive (SupCon) loss to learn the inherent structure of multivariate time series. First, two separate augmentation families, including strong and weak augmentation methods, are utilized to generate augmented data for the source and target networks, respectively. Second, we propose the instance-level, and cluster-level SupCon learning approaches to capture contextual information to learn the discriminative and universal representation for multivariate time series datasets. In the instance-level SupCon learning approach, for each given anchor instance that comes from the source network, the low-variance output encodings from the target network are sampled as positive and negative instances based on their labels. However, the cluster-level approach is performed between each instance and cluster centers among batches, as opposed to the instance-level approach. The cluster-level SupCon loss attempts to maximize the similarities between each instance and cluster centers among batches. We tested this novel approach on two small cardiopulmonary exercise testing (CPET) datasets and the real-world UEA Multivariate time series archive. The results of the SupCon-TSC model on CPET datasets indicate its capability to learn more discriminative features than existing approaches in situations where the size of the dataset is small. Moreover, the results on the UEA archive show that training a classifier on top of the universal representation features learned by our proposed method outperforms the state-of-the-art approaches.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"85 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Revealing the structural behaviour of Brunelleschi’s Dome with machine learning techniques 用机器学习技术揭示布鲁内莱斯基圆顶的结构行为

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-06 DOI: 10.1007/s10618-024-01004-3

Stefano Masini, Silvia Bacci, Fabrizio Cipollini, Bruno Bertaccini

The Brunelleschi’s Dome is one of the most iconic symbols of the Renaissance and is among the largest masonry domes ever constructed. Since the late 17th century, first masonry cracks appeared on the Dome, giving the start to a monitoring activity. In modern times, since 1988 a monitoring system comprised of 166 electronic sensors, including deformometers and thermometers, has been in operation, providing a valuable source of real-time data on the monument’s health status. With the deformometers taking measurements at least four times per day, a vast amount of data is now available to explore the potential of the latest Artificial Intelligence and Machine Learning techniques in the field of historical-architectural heritage conservation. The objective of this contribution is twofold. Firstly, for the first time ever, we aim to unveil the overall structural behaviour of the Dome as a whole, as well as that of its specific sections (known as webs). We achieve this by evaluating the effectiveness of certain dimensionality reduction techniques on the extensive daily detections generated by the monitoring system, while also accounting for fluctuations in temperature over time. Secondly, we estimate a number of recurrent and convolutional neural network models to verify their capability for medium- and long-term prediction of the structural evolution of the Dome. We believe this contribution is an important step forward in the protection and preservation of historical buildings, showing the utility of machine learning in a context in which these are still little used.

布鲁内莱斯基圆顶是文艺复兴时期最具代表性的标志之一，也是有史以来最大的砖石圆顶之一。自 17 世纪末以来，穹顶上出现了第一批砌体裂缝，从而开始了监测活动。到了现代，自 1988 年起，由 166 个电子传感器（包括变形计和温度计）组成的监测系统开始运行，为了解纪念碑的健康状况提供了宝贵的实时数据来源。由于变形计每天至少测量四次，因此现在可以利用大量数据来探索最新人工智能和机器学习技术在历史建筑遗产保护领域的潜力。这一贡献有两个目的。首先，我们首次旨在揭示穹顶作为一个整体及其特定部分（称为网状结构）的整体结构行为。为此，我们评估了某些降维技术对监测系统每天产生的大量检测结果的有效性，同时还考虑了温度随时间的波动。其次，我们估算了一些递归和卷积神经网络模型，以验证它们对穹顶结构演变进行中长期预测的能力。我们相信，这项研究在保护和保存历史建筑方面迈出了重要的一步，展示了机器学习在目前仍很少使用的情况下的实用性。

{"title":"Revealing the structural behaviour of Brunelleschi’s Dome with machine learning techniques","authors":"Stefano Masini, Silvia Bacci, Fabrizio Cipollini, Bruno Bertaccini","doi":"10.1007/s10618-024-01004-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01004-3","url":null,"abstract":"The Brunelleschi’s Dome is one of the most iconic symbols of the Renaissance and is among the largest masonry domes ever constructed. Since the late 17th century, first masonry cracks appeared on the Dome, giving the start to a monitoring activity. In modern times, since 1988 a monitoring system comprised of 166 electronic sensors, including deformometers and thermometers, has been in operation, providing a valuable source of real-time data on the monument’s health status. With the deformometers taking measurements at least four times per day, a vast amount of data is now available to explore the potential of the latest Artificial Intelligence and Machine Learning techniques in the field of historical-architectural heritage conservation. The objective of this contribution is twofold. Firstly, for the first time ever, we aim to unveil the overall structural behaviour of the Dome as a whole, as well as that of its specific sections (known as webs). We achieve this by evaluating the effectiveness of certain dimensionality reduction techniques on the extensive daily detections generated by the monitoring system, while also accounting for fluctuations in temperature over time. Secondly, we estimate a number of recurrent and convolutional neural network models to verify their capability for medium- and long-term prediction of the structural evolution of the Dome. We believe this contribution is an important step forward in the protection and preservation of historical buildings, showing the utility of machine learning in a context in which these are still little used.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VEM $$^2$$ L: an easy but effective framework for fusing text and structure knowledge on sparse knowledge graph completion VEM $$^2$ L：在稀疏知识图谱完成上融合文本和结构知识的简单而有效的框架

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-06 DOI: 10.1007/s10618-023-01001-y

Tao He, Ming Liu, Yixin Cao, Meng Qu, Zihao Zheng, Bing Qin

The task of Knowledge Graph Completion (KGC) is to infer missing links for Knowledge Graphs (KGs) by analyzing graph structures. However, with increasing sparsity in KGs, this task becomes increasingly challenging. In this paper, we propose VEM(^2)L, a joint learning framework that incorporates structure and relevant text information to supplement insufficient features for sparse KGs. We begin by training two pre-existing KGC models: one based on structure and the other based on text. Our ultimate goal is to fuse knowledge acquired by these models. To achieve this, we divide knowledge within the models into two non-overlapping parts: expressive power and generalization ability. We then propose two different joint learning methods that co-distill these two kinds of knowledge respectively. For expressive power, we allow each model to learn from and exchange knowledge mutually on training examples. For the generalization ability, we propose a novel co-distillation strategy using the Variational EM algorithm on unobserved queries. Our proposed joint learning framework is supported by both detailed theoretical evidence and qualitative experiments, demonstrating its effectiveness.

知识图谱补全（KGC）的任务是通过分析图谱结构来推断知识图谱（KG）中缺失的链接。然而，随着知识图谱的稀疏性不断增加，这项任务变得越来越具有挑战性。在本文中，我们提出了 VEM(^2)L 这一联合学习框架，它结合了结构和相关文本信息，以补充稀疏知识图谱中不足的特征。我们首先训练两个已有的 KGC 模型：一个基于结构，另一个基于文本。我们的最终目标是融合这些模型获得的知识。为此，我们将模型中的知识分为两个不重叠的部分：表达能力和泛化能力。然后，我们提出了两种不同的联合学习方法，分别对这两种知识进行联合提炼。在表现力方面，我们允许每个模型在训练实例中相互学习和交流知识。在泛化能力方面，我们提出了一种新颖的共同提炼策略，利用变异 EM 算法对未观察到的查询进行提炼。我们提出的联合学习框架得到了详细理论证据和定性实验的支持，证明了其有效性。

{"title":"VEM $$^2$$ L: an easy but effective framework for fusing text and structure knowledge on sparse knowledge graph completion","authors":"Tao He, Ming Liu, Yixin Cao, Meng Qu, Zihao Zheng, Bing Qin","doi":"10.1007/s10618-023-01001-y","DOIUrl":"https://doi.org/10.1007/s10618-023-01001-y","url":null,"abstract":"The task of Knowledge Graph Completion (KGC) is to infer missing links for Knowledge Graphs (KGs) by analyzing graph structures. However, with increasing sparsity in KGs, this task becomes increasingly challenging. In this paper, we propose VEM(^2)L, a joint learning framework that incorporates structure and relevant text information to supplement insufficient features for sparse KGs. We begin by training two pre-existing KGC models: one based on structure and the other based on text. Our ultimate goal is to fuse knowledge acquired by these models. To achieve this, we divide knowledge within the models into two non-overlapping parts: expressive power and generalization ability. We then propose two different joint learning methods that co-distill these two kinds of knowledge respectively. For expressive power, we allow each model to learn from and exchange knowledge mutually on training examples. For the generalization ability, we propose a novel co-distillation strategy using the Variational EM algorithm on unobserved queries. Our proposed joint learning framework is supported by both detailed theoretical evidence and qualitative experiments, demonstrating its effectiveness.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"18 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MASS: distance profile of a query over a time series MASS：查询在时间序列中的距离分布图

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-02-05 DOI: 10.1007/s10618-024-01005-2

Sheng Zhong, Abdullah Mueen

Given a long time series, the distance profile of a query time series computes distances between the query and every possible subsequence of a long time series. MASS (Mueen’s Algorithm for Similarity Search) is an algorithm to efficiently compute distance profile under z-normalized Euclidean distance (Mueen et al. in The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html, 2017). MASS is recognized as a useful tool in many data mining works. However, complete documentation of the increasingly efficient versions of the algorithm does not exist. In this paper, we formalize the notion of a distance profile, describe four versions of the MASS algorithm, show several extensions of distance profiles under various operating conditions, describe how MASS improves performances of existing data mining algorithms, and finally, show utility of MASS in domains including seismology, robotics and power grids.

在给定一个长时间序列的情况下，查询时间序列的距离轮廓计算的是查询时间序列与长时间序列的每个可能子序列之间的距离。MASS（Mueen's Algorithm for Similarity Search）是一种在z归一化欧氏距离下高效计算距离剖面的算法（Mueen等人在The fastest similarity search algorithm for time series subences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html，2017）。在许多数据挖掘工作中，MASS 是公认的有用工具。然而，关于该算法日益高效版本的完整文档并不存在。在本文中，我们正式定义了距离剖面的概念，描述了 MASS 算法的四个版本，展示了距离剖面在各种操作条件下的若干扩展，描述了 MASS 如何提高现有数据挖掘算法的性能，最后展示了 MASS 在地震学、机器人学和电网等领域的实用性。

引用次数: 0