Pub Date : 2024-03-15DOI: 10.1007/s10618-024-01011-4
Husheng Guo, Hai Li, Lu Cong, Wenjian Wang
Concept evolution detection is an important and difficult problem in streaming data mining. When the labeled samples in streaming data insufficient to reflect the training data distribution, it will often further restrict the detection performance. This paper proposed a concept evolution detection method based on active learning (CE_AL). Firstly, the initial classifiers are constructed by a small number of labeled samples. The sample areas are divided into the automatic labeling and the active labeling areas according to the relationship between the classifiers of different categories. Secondly, for online new coming samples, according to their different areas, two strategies based on the automatic learning-based model labeling and active learning-based expert labeling are adopted respectively, which can improve the online learning performance with only a small number of labeled samples. Besides, the strategy of “data enhance” combined with “model enhance” is adopted to accelerate the convergence of the evolution category detection model. The experimental results show that the proposed CE_AL method can enhance the detection performance of concept evolution and realize efficient learning in an unstable environment by labeling a small number of key samples.
{"title":"Online concept evolution detection based on active learning","authors":"Husheng Guo, Hai Li, Lu Cong, Wenjian Wang","doi":"10.1007/s10618-024-01011-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01011-4","url":null,"abstract":"<p>Concept evolution detection is an important and difficult problem in streaming data mining. When the labeled samples in streaming data insufficient to reflect the training data distribution, it will often further restrict the detection performance. This paper proposed a concept evolution detection method based on active learning (CE_AL). Firstly, the initial classifiers are constructed by a small number of labeled samples. The sample areas are divided into the automatic labeling and the active labeling areas according to the relationship between the classifiers of different categories. Secondly, for online new coming samples, according to their different areas, two strategies based on the automatic learning-based model labeling and active learning-based expert labeling are adopted respectively, which can improve the online learning performance with only a small number of labeled samples. Besides, the strategy of “data enhance” combined with “model enhance” is adopted to accelerate the convergence of the evolution category detection model. The experimental results show that the proposed CE_AL method can enhance the detection performance of concept evolution and realize efficient learning in an unstable environment by labeling a small number of key samples.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"23 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140147920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-27DOI: 10.1007/s10618-023-00993-x
Christian A. Scholbeck, Giuseppe Casalicchio, Christoph Molnar, Bernd Bischl, Christian Heumann
Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models such as generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either as derivatives of the prediction function or forward differences in prediction due to changes in feature values. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a general model-agnostic interpretation method for machine learning models. This may stem from the ambiguity surrounding marginal effects and their inability to deal with the non-linearities found in black box models. We introduce a unified definition of forward marginal effects (FMEs) that includes univariate and multivariate, as well as continuous, categorical, and mixed-type features. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for FMEs. Furthermore, we argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to average homogeneous FMEs within population subgroups, which serve as conditional feature effect estimates.
{"title":"Marginal effects for non-linear prediction functions","authors":"Christian A. Scholbeck, Giuseppe Casalicchio, Christoph Molnar, Bernd Bischl, Christian Heumann","doi":"10.1007/s10618-023-00993-x","DOIUrl":"https://doi.org/10.1007/s10618-023-00993-x","url":null,"abstract":"<p>Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models such as generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either as derivatives of the prediction function or forward differences in prediction due to changes in feature values. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a general model-agnostic interpretation method for machine learning models. This may stem from the ambiguity surrounding marginal effects and their inability to deal with the non-linearities found in black box models. We introduce a unified definition of forward marginal effects (FMEs) that includes univariate and multivariate, as well as continuous, categorical, and mixed-type features. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for FMEs. Furthermore, we argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to average homogeneous FMEs within population subgroups, which serve as conditional feature effect estimates.\u0000</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"476 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140010901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-22DOI: 10.1007/s10618-024-01012-3
Xinran Wu, Kun Yue, Liang Duan, Xiaodong Fu
Artificial intelligence applications could be more powerful and comprehensive by incorporating the ability of inference, which could be achieved by probabilistic inference over implicit relations. It is significant yet challenging to represent implicit relations among observed variables and latent ones like disease etiologies and user preferences. In this paper, we propose the BN with multiple latent variables (MLBN) as the framework for representing the dependence relations, where multiple latent variables are incorporated to describe multi-dimensional abstract concepts. However, the efficiency of MLBN learning and effectiveness of MLBN based applications are still nontrivial due to the presence of multiple latent variables. To this end, we first propose the constraint induced and Spark based algorithm for MLBN learning, as well as several optimization strategies. Moreover, we present the concept of variation degree and further design a subgraph based algorithm for incremental learning of MLBN. Experimental results suggest that our proposed MLBN model could represent the dependence relations correctly. Our proposed method outperforms some state-of-the-art competitors for personalized recommendation, and facilitates some typical approaches to achieve better performance.
{"title":"Learning a Bayesian network with multiple latent variables for implicit relation representation","authors":"Xinran Wu, Kun Yue, Liang Duan, Xiaodong Fu","doi":"10.1007/s10618-024-01012-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01012-3","url":null,"abstract":"<p>Artificial intelligence applications could be more powerful and comprehensive by incorporating the ability of inference, which could be achieved by probabilistic inference over implicit relations. It is significant yet challenging to represent implicit relations among observed variables and latent ones like disease etiologies and user preferences. In this paper, we propose the BN with multiple latent variables (MLBN) as the framework for representing the dependence relations, where multiple latent variables are incorporated to describe multi-dimensional abstract concepts. However, the efficiency of MLBN learning and effectiveness of MLBN based applications are still nontrivial due to the presence of multiple latent variables. To this end, we first propose the constraint induced and Spark based algorithm for MLBN learning, as well as several optimization strategies. Moreover, we present the concept of variation degree and further design a subgraph based algorithm for incremental learning of MLBN. Experimental results suggest that our proposed MLBN model could represent the dependence relations correctly. Our proposed method outperforms some state-of-the-art competitors for personalized recommendation, and facilitates some typical approaches to achieve better performance.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"94 24 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139947420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-19DOI: 10.1007/s10618-024-01008-z
Manjusha Ravindranath, K. Selçuk Candan, Maria Luisa Sapino, Brian Appavu
Deep learning has been applied successfully in sequence understanding and translation problems, especially in univariate, unimodal contexts, where large number of supervision data are available. The effectiveness of deep learning in more complex (multi-modal, multi-variate) contexts, where supervision data is rare, however, is generally not satisfactory. In this paper, we focus on improving detection and prediction accuracy in precisely such contexts – in particular, we focus on the problem of predicting seizure onsets relying on multi-modal (EEG, ICP, ECG, and ABP) sensory data streams, some of which (such as EEG) are inherently multi-variate due to the placement of multiple sensors to capture spatial distribution of the relevant signals. In particular, we note that multi-variate time series often carry robust, spatio-temporally localized features that could help predict onset events. We further argue that such features can be used to support implementation of metadata supported multivariate attention (or MMA) mechanisms that help significantly improve the effectiveness of neural networks architectures. In this paper, we use the proposed MMA approach to develop a multi-modal LSTM-based neural network architecture to tackle seizure onset detection and prediction tasks relying on EEG, ICP, ECG, and ABP data streams. We experimentally evaluate the proposed architecture under different scenarios – the results illustrate the effectiveness of the proposed attention mechanism, especially compared against other metadata driven competitors.
{"title":"MMA: metadata supported multi-variate attention for onset detection and prediction","authors":"Manjusha Ravindranath, K. Selçuk Candan, Maria Luisa Sapino, Brian Appavu","doi":"10.1007/s10618-024-01008-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01008-z","url":null,"abstract":"<p>Deep learning has been applied successfully in sequence understanding and translation problems, especially in univariate, unimodal contexts, where large number of supervision data are available. The effectiveness of deep learning in more complex (multi-modal, multi-variate) contexts, where supervision data is rare, however, is generally not satisfactory. In this paper, we focus on improving detection and prediction accuracy in precisely such contexts – in particular, we focus on the problem of predicting seizure onsets relying on multi-modal (EEG, ICP, ECG, and ABP) sensory data streams, some of which (such as EEG) are inherently multi-variate due to the placement of multiple sensors to capture spatial distribution of the relevant signals. In particular, we note that multi-variate time series often carry robust, spatio-temporally localized features that could help predict onset events. We further argue that such features can be used to support implementation of metadata supported multivariate attention (or MMA) mechanisms that help significantly improve the effectiveness of neural networks architectures. In this paper, we use the proposed MMA approach to develop a multi-modal LSTM-based neural network architecture to tackle seizure onset detection and prediction tasks relying on EEG, ICP, ECG, and ABP data streams. We experimentally evaluate the proposed architecture under different scenarios – the results illustrate the effectiveness of the proposed attention mechanism, especially compared against other metadata driven competitors.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"7 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139909369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-15DOI: 10.1007/s10618-024-01007-0
Abstract
Bayesian networks faithfully represent the symmetric conditional independences existing between the components of a random vector. Staged trees are an extension of Bayesian networks for categorical random vectors whose graph represents non-symmetric conditional independences via vertex coloring. However, since they are based on a tree representation of the sample space, the underlying graph becomes cluttered and difficult to visualize as the number of variables increases. Here, we introduce the first structural learning algorithms for the class of simple staged trees, entertaining a compact coalescence of the underlying tree from which non-symmetric independences can be easily read. We show that data-learned simple staged trees often outperform Bayesian networks in model fit and illustrate how the coalesced graph is used to identify non-symmetric conditional independences.
{"title":"Structural learning of simple staged trees","authors":"","doi":"10.1007/s10618-024-01007-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01007-0","url":null,"abstract":"<h3>Abstract</h3> <p>Bayesian networks faithfully represent the symmetric conditional independences existing between the components of a random vector. Staged trees are an extension of Bayesian networks for categorical random vectors whose graph represents non-symmetric conditional independences via vertex coloring. However, since they are based on a tree representation of the sample space, the underlying graph becomes cluttered and difficult to visualize as the number of variables increases. Here, we introduce the first structural learning algorithms for the class of simple staged trees, entertaining a compact coalescence of the underlying tree from which non-symmetric independences can be easily read. We show that data-learned simple staged trees often outperform Bayesian networks in model fit and illustrate how the coalesced graph is used to identify non-symmetric conditional independences.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"256 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-09DOI: 10.1007/s10618-024-01006-1
Nazanin Moradinasab, Suchetha Sharma, Ronen Bar-Yoseph, Shlomit Radom-Aizik, Kenneth C. Bilchick, Dan M. Cooper, Arthur Weltman, Donald E. Brown
The multivariate time series classification (MTSC) task aims to predict a class label for a given time series. Recently, modern deep learning-based approaches have achieved promising performance over traditional methods for MTSC tasks. The success of these approaches relies on access to the massive amount of labeled data (i.e., annotating or assigning tags to each sample that shows its corresponding category). However, obtaining a massive amount of labeled data is usually very time-consuming and expensive in many real-world applications such as medicine, because it requires domain experts’ knowledge to annotate data. Insufficient labeled data prevents these models from learning discriminative features, resulting in poor margins that reduce generalization performance. To address this challenge, we propose a novel approach: supervised contrastive learning for time series classification (SupCon-TSC). This approach improves the classification performance by learning the discriminative low-dimensional representations of multivariate time series, and its end-to-end structure allows for interpretable outcomes. It is based on supervised contrastive (SupCon) loss to learn the inherent structure of multivariate time series. First, two separate augmentation families, including strong and weak augmentation methods, are utilized to generate augmented data for the source and target networks, respectively. Second, we propose the instance-level, and cluster-level SupCon learning approaches to capture contextual information to learn the discriminative and universal representation for multivariate time series datasets. In the instance-level SupCon learning approach, for each given anchor instance that comes from the source network, the low-variance output encodings from the target network are sampled as positive and negative instances based on their labels. However, the cluster-level approach is performed between each instance and cluster centers among batches, as opposed to the instance-level approach. The cluster-level SupCon loss attempts to maximize the similarities between each instance and cluster centers among batches. We tested this novel approach on two small cardiopulmonary exercise testing (CPET) datasets and the real-world UEA Multivariate time series archive. The results of the SupCon-TSC model on CPET datasets indicate its capability to learn more discriminative features than existing approaches in situations where the size of the dataset is small. Moreover, the results on the UEA archive show that training a classifier on top of the universal representation features learned by our proposed method outperforms the state-of-the-art approaches.
{"title":"Universal representation learning for multivariate time series using the instance-level and cluster-level supervised contrastive learning","authors":"Nazanin Moradinasab, Suchetha Sharma, Ronen Bar-Yoseph, Shlomit Radom-Aizik, Kenneth C. Bilchick, Dan M. Cooper, Arthur Weltman, Donald E. Brown","doi":"10.1007/s10618-024-01006-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01006-1","url":null,"abstract":"<p>The multivariate time series classification (MTSC) task aims to predict a class label for a given time series. Recently, modern deep learning-based approaches have achieved promising performance over traditional methods for MTSC tasks. The success of these approaches relies on access to the massive amount of labeled data (i.e., annotating or assigning tags to each sample that shows its corresponding category). However, obtaining a massive amount of labeled data is usually very time-consuming and expensive in many real-world applications such as medicine, because it requires domain experts’ knowledge to annotate data. Insufficient labeled data prevents these models from learning discriminative features, resulting in poor margins that reduce generalization performance. To address this challenge, we propose a novel approach: supervised contrastive learning for time series classification (SupCon-TSC). This approach improves the classification performance by learning the discriminative low-dimensional representations of multivariate time series, and its end-to-end structure allows for interpretable outcomes. It is based on supervised contrastive (SupCon) loss to learn the inherent structure of multivariate time series. First, two separate augmentation families, including strong and weak augmentation methods, are utilized to generate augmented data for the source and target networks, respectively. Second, we propose the instance-level, and cluster-level SupCon learning approaches to capture contextual information to learn the discriminative and universal representation for multivariate time series datasets. In the instance-level SupCon learning approach, for each given anchor instance that comes from the source network, the low-variance output encodings from the target network are sampled as positive and negative instances based on their labels. However, the cluster-level approach is performed between each instance and cluster centers among batches, as opposed to the instance-level approach. The cluster-level SupCon loss attempts to maximize the similarities between each instance and cluster centers among batches. We tested this novel approach on two small cardiopulmonary exercise testing (CPET) datasets and the real-world UEA Multivariate time series archive. The results of the SupCon-TSC model on CPET datasets indicate its capability to learn more discriminative features than existing approaches in situations where the size of the dataset is small. Moreover, the results on the UEA archive show that training a classifier on top of the universal representation features learned by our proposed method outperforms the state-of-the-art approaches.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"85 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-06DOI: 10.1007/s10618-024-01004-3
Stefano Masini, Silvia Bacci, Fabrizio Cipollini, Bruno Bertaccini
The Brunelleschi’s Dome is one of the most iconic symbols of the Renaissance and is among the largest masonry domes ever constructed. Since the late 17th century, first masonry cracks appeared on the Dome, giving the start to a monitoring activity. In modern times, since 1988 a monitoring system comprised of 166 electronic sensors, including deformometers and thermometers, has been in operation, providing a valuable source of real-time data on the monument’s health status. With the deformometers taking measurements at least four times per day, a vast amount of data is now available to explore the potential of the latest Artificial Intelligence and Machine Learning techniques in the field of historical-architectural heritage conservation. The objective of this contribution is twofold. Firstly, for the first time ever, we aim to unveil the overall structural behaviour of the Dome as a whole, as well as that of its specific sections (known as webs). We achieve this by evaluating the effectiveness of certain dimensionality reduction techniques on the extensive daily detections generated by the monitoring system, while also accounting for fluctuations in temperature over time. Secondly, we estimate a number of recurrent and convolutional neural network models to verify their capability for medium- and long-term prediction of the structural evolution of the Dome. We believe this contribution is an important step forward in the protection and preservation of historical buildings, showing the utility of machine learning in a context in which these are still little used.
{"title":"Revealing the structural behaviour of Brunelleschi’s Dome with machine learning techniques","authors":"Stefano Masini, Silvia Bacci, Fabrizio Cipollini, Bruno Bertaccini","doi":"10.1007/s10618-024-01004-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01004-3","url":null,"abstract":"<p>The Brunelleschi’s Dome is one of the most iconic symbols of the Renaissance and is among the largest masonry domes ever constructed. Since the late 17th century, first masonry cracks appeared on the Dome, giving the start to a monitoring activity. In modern times, since 1988 a monitoring system comprised of 166 electronic sensors, including deformometers and thermometers, has been in operation, providing a valuable source of real-time data on the monument’s health status. With the deformometers taking measurements at least four times per day, a vast amount of data is now available to explore the potential of the latest Artificial Intelligence and Machine Learning techniques in the field of historical-architectural heritage conservation. The objective of this contribution is twofold. Firstly, for the first time ever, we aim to unveil the overall structural behaviour of the Dome as a whole, as well as that of its specific sections (known as webs). We achieve this by evaluating the effectiveness of certain dimensionality reduction techniques on the extensive daily detections generated by the monitoring system, while also accounting for fluctuations in temperature over time. Secondly, we estimate a number of recurrent and convolutional neural network models to verify their capability for medium- and long-term prediction of the structural evolution of the Dome. We believe this contribution is an important step forward in the protection and preservation of historical buildings, showing the utility of machine learning in a context in which these are still little used.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-06DOI: 10.1007/s10618-023-01001-y
Tao He, Ming Liu, Yixin Cao, Meng Qu, Zihao Zheng, Bing Qin
The task of Knowledge Graph Completion (KGC) is to infer missing links for Knowledge Graphs (KGs) by analyzing graph structures. However, with increasing sparsity in KGs, this task becomes increasingly challenging. In this paper, we propose VEM(^2)L, a joint learning framework that incorporates structure and relevant text information to supplement insufficient features for sparse KGs. We begin by training two pre-existing KGC models: one based on structure and the other based on text. Our ultimate goal is to fuse knowledge acquired by these models. To achieve this, we divide knowledge within the models into two non-overlapping parts: expressive power and generalization ability. We then propose two different joint learning methods that co-distill these two kinds of knowledge respectively. For expressive power, we allow each model to learn from and exchange knowledge mutually on training examples. For the generalization ability, we propose a novel co-distillation strategy using the Variational EM algorithm on unobserved queries. Our proposed joint learning framework is supported by both detailed theoretical evidence and qualitative experiments, demonstrating its effectiveness.
知识图谱补全(KGC)的任务是通过分析图谱结构来推断知识图谱(KG)中缺失的链接。然而,随着知识图谱的稀疏性不断增加,这项任务变得越来越具有挑战性。在本文中,我们提出了 VEM(^2)L 这一联合学习框架,它结合了结构和相关文本信息,以补充稀疏知识图谱中不足的特征。我们首先训练两个已有的 KGC 模型:一个基于结构,另一个基于文本。我们的最终目标是融合这些模型获得的知识。为此,我们将模型中的知识分为两个不重叠的部分:表达能力和泛化能力。然后,我们提出了两种不同的联合学习方法,分别对这两种知识进行联合提炼。在表现力方面,我们允许每个模型在训练实例中相互学习和交流知识。在泛化能力方面,我们提出了一种新颖的共同提炼策略,利用变异 EM 算法对未观察到的查询进行提炼。我们提出的联合学习框架得到了详细理论证据和定性实验的支持,证明了其有效性。
{"title":"VEM $$^2$$ L: an easy but effective framework for fusing text and structure knowledge on sparse knowledge graph completion","authors":"Tao He, Ming Liu, Yixin Cao, Meng Qu, Zihao Zheng, Bing Qin","doi":"10.1007/s10618-023-01001-y","DOIUrl":"https://doi.org/10.1007/s10618-023-01001-y","url":null,"abstract":"<p>The task of Knowledge Graph Completion (KGC) is to infer missing links for Knowledge Graphs (KGs) by analyzing graph structures. However, with increasing sparsity in KGs, this task becomes increasingly challenging. In this paper, we propose VEM<span>(^2)</span>L, a joint learning framework that incorporates structure and relevant text information to supplement insufficient features for sparse KGs. We begin by training two pre-existing KGC models: one based on structure and the other based on text. Our ultimate goal is to fuse knowledge acquired by these models. To achieve this, we divide knowledge within the models into two non-overlapping parts: <b>expressive power</b> and <b>generalization ability</b>. We then propose two different joint learning methods that co-distill these two kinds of knowledge respectively. For expressive power, we allow each model to learn from and exchange knowledge mutually on training examples. For the generalization ability, we propose a novel co-distillation strategy using the Variational EM algorithm on unobserved queries. Our proposed joint learning framework is supported by both detailed theoretical evidence and qualitative experiments, demonstrating its effectiveness.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"18 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-05DOI: 10.1007/s10618-024-01005-2
Sheng Zhong, Abdullah Mueen
Given a long time series, the distance profile of a query time series computes distances between the query and every possible subsequence of a long time series. MASS (Mueen’s Algorithm for Similarity Search) is an algorithm to efficiently compute distance profile under z-normalized Euclidean distance (Mueen et al. in The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html, 2017). MASS is recognized as a useful tool in many data mining works. However, complete documentation of the increasingly efficient versions of the algorithm does not exist. In this paper, we formalize the notion of a distance profile, describe four versions of the MASS algorithm, show several extensions of distance profiles under various operating conditions, describe how MASS improves performances of existing data mining algorithms, and finally, show utility of MASS in domains including seismology, robotics and power grids.
在给定一个长时间序列的情况下,查询时间序列的距离轮廓计算的是查询时间序列与长时间序列的每个可能子序列之间的距离。MASS(Mueen's Algorithm for Similarity Search)是一种在z归一化欧氏距离下高效计算距离剖面的算法(Mueen等人在The fastest similarity search algorithm for time series subences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html,2017)。在许多数据挖掘工作中,MASS 是公认的有用工具。然而,关于该算法日益高效版本的完整文档并不存在。在本文中,我们正式定义了距离剖面的概念,描述了 MASS 算法的四个版本,展示了距离剖面在各种操作条件下的若干扩展,描述了 MASS 如何提高现有数据挖掘算法的性能,最后展示了 MASS 在地震学、机器人学和电网等领域的实用性。
{"title":"MASS: distance profile of a query over a time series","authors":"Sheng Zhong, Abdullah Mueen","doi":"10.1007/s10618-024-01005-2","DOIUrl":"https://doi.org/10.1007/s10618-024-01005-2","url":null,"abstract":"<p>Given a long time series, the distance profile of a query time series computes distances between the query and every possible subsequence of a long time series. MASS (Mueen’s Algorithm for Similarity Search) is an algorithm to efficiently compute distance profile under z-normalized Euclidean distance (Mueen et al. in The fastest similarity search algorithm for time series subsequences under Euclidean distance. http://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html, 2017). MASS is recognized as a useful tool in many data mining works. However, complete documentation of the increasingly efficient versions of the algorithm does not exist. In this paper, we formalize the notion of a distance profile, describe four versions of the MASS algorithm, show several extensions of distance profiles under various operating conditions, describe how MASS improves performances of existing data mining algorithms, and finally, show utility of MASS in domains including seismology, robotics and power grids.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"142 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139751823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-01-31DOI: 10.1007/s10618-024-01002-5
Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho
Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.
{"title":"Better trees: an empirical study on hyperparameter tuning of classification decision tree induction algorithms","authors":"Rafael Gomes Mantovani, Tomáš Horváth, André L. D. Rossi, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André C. P. L. F. de Carvalho","doi":"10.1007/s10618-024-01002-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01002-5","url":null,"abstract":"<p>Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predictive performance. However, insights into efficiently exploring this vast space of configurations and dealing with the trade-off between predictive and runtime performance remain challenging. Furthermore, there are cases where the default hyperparameters fit the suitable configuration. Additionally, for many reasons, including model validation and attendance to new legislation, there is an increasing interest in interpretable models, such as those created by the decision tree (DT) induction algorithms. This paper provides a comprehensive approach for investigating the effects of hyperparameter tuning for the two DT induction algorithms most often used, CART and C4.5. DT induction algorithms present high predictive performance and interpretable classification models, though many hyperparameters need to be adjusted. Experiments were carried out with different tuning strategies to induce models and to evaluate hyperparameters’ relevance using 94 classification datasets from OpenML. The experimental results point out that different hyperparameter profiles for the tuning of each algorithm provide statistically significant improvements in most of the datasets for CART, but only in one-third for C4.5. Although different algorithms may present different tuning scenarios, the tuning techniques generally required few evaluations to find accurate solutions. Furthermore, the best technique for all the algorithms was the Irace. Finally, we found out that tuning a specific small subset of hyperparameters is a good alternative for achieving optimal predictive performance.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"13 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139658401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}