Pub Date : 2024-08-15DOI: 10.1007/s10618-024-01067-2
Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz
Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of (n>2) classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.
量化,即预测未标记数据项袋中类别流行值的任务,近年来受到越来越多的关注。然而,大多数量化研究都集中在开发二元和多类问题的算法上,在这些问题中,类是没有排序的。在这里,我们研究的是序数情况,即在类(n>2)集合上定义了总序的情况。我们对这一领域有三个主要贡献。首先,我们创建并提供了两个用于序量化(OQ)研究的数据集,克服了之前可用数据集的不足。其次,我们通过实验比较了迄今为止文献中提出的最重要的 OQ 算法。为此,我们汇集了来自数据挖掘和天体物理学等不同研究领域的作者提出的算法,这些作者并不了解彼此的发展情况。第三,我们提出了一类新型正则化 OQ 算法,在实验中表现优于现有算法。性能提升的关键在于,我们的正则化可以防止顺序上难以置信的估计,假设顺序分布在实践中趋于平稳。我们在几个实际应用中非正式地验证了这一假设。
{"title":"Regularization-based methods for ordinal quantification","authors":"Mirko Bunse, Alejandro Moreo, Fabrizio Sebastiani, Martin Senz","doi":"10.1007/s10618-024-01067-2","DOIUrl":"https://doi.org/10.1007/s10618-024-01067-2","url":null,"abstract":"<p>Quantification, i.e., the task of predicting the class prevalence values in bags of unlabeled data items, has received increased attention in recent years. However, most quantification research has concentrated on developing algorithms for binary and multi-class problems in which the classes are not ordered. Here, we study the ordinal case, i.e., the case in which a total order is defined on the set of <span>(n>2)</span> classes. We give three main contributions to this field. First, we create and make available two datasets for ordinal quantification (OQ) research that overcome the inadequacies of the previously available ones. Second, we experimentally compare the most important OQ algorithms proposed in the literature so far. To this end, we bring together algorithms proposed by authors from very different research fields, such as data mining and astrophysics, who were unaware of each others’ developments. Third, we propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments. The key to this gain in performance is that our regularization prevents ordinally implausible estimates, assuming that ordinal distributions tend to be smooth in practice. We informally verify this assumption for several real-world applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"75 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-13DOI: 10.1007/s10618-024-01065-4
Sean Maxwell, Mehmet Koyutürk
Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.
{"title":"Random walks with variable restarts for negative-example-informed label propagation","authors":"Sean Maxwell, Mehmet Koyutürk","doi":"10.1007/s10618-024-01065-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01065-4","url":null,"abstract":"<p>Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts\", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop <span>CusTaRd</span>, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of <span>CusTaRd</span>, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that <span>CusTaRd</span> consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"41 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s10618-024-01028-9
Omar Bahri, Peiyu Li, Soukaina Filali Boubrahimi, Shah Muhammad Hamdi
The opacity inherent in machine learning models presents a significant hindrance to their widespread incorporation into decision-making processes. To address this challenge and foster trust among stakeholders while ensuring decision fairness, the data mining community has been actively advancing the explainable artificial intelligence paradigm. This paper contributes to the evolving field by focusing on counterfactual generation for time series classification models, a domain where research is relatively scarce. We develop, a post-hoc, model agnostic counterfactual explanation algorithm that leverages the Matrix Profile to map time series discords to their nearest neighbors in a target sequence and use this mapping to generate new counterfactual instances. To our knowledge, this is the first effort towards the use of time series discords for counterfactual explanations. We evaluate our algorithm on the University of California Riverside and University of East Anglia archives and compare it to three state-of-the-art univariate and multivariate methods.
{"title":"Discord-based counterfactual explanations for time series classification","authors":"Omar Bahri, Peiyu Li, Soukaina Filali Boubrahimi, Shah Muhammad Hamdi","doi":"10.1007/s10618-024-01028-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01028-9","url":null,"abstract":"<p>The opacity inherent in machine learning models presents a significant hindrance to their widespread incorporation into decision-making processes. To address this challenge and foster trust among stakeholders while ensuring decision fairness, the data mining community has been actively advancing the explainable artificial intelligence paradigm. This paper contributes to the evolving field by focusing on counterfactual generation for time series classification models, a domain where research is relatively scarce. We develop, a post-hoc, model agnostic counterfactual explanation algorithm that leverages the Matrix Profile to map time series discords to their nearest neighbors in a target sequence and use this mapping to generate new counterfactual instances. To our knowledge, this is the first effort towards the use of time series discords for counterfactual explanations. We evaluate our algorithm on the University of California Riverside and University of East Anglia archives and compare it to three state-of-the-art univariate and multivariate methods.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"26 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s10618-024-01061-8
Vincenzo Bonnici, Roberto Grasso, Giovanni Micale, Antonio di Maria, Dennis Shasha, Alfredo Pulvirenti, Rosalba Giugno
Consider a large labeled graph (network), denoted the target. Subgraph matching is the problem of finding all instances of a small subgraph, denoted the query, in the target graph. Unlike the majority of existing methods that are restricted to graphs with labels solely on vertices, our proposed approach, named can effectively handle graphs with labels on both vertices and edges. ntroduces an efficient new vertex/edge domain data structure filtering procedure to speed up subgraph queries. The procedure, called path-based reduction, filters initial domains by scanning them for paths up to a specified length that appear in the query graph. Additionally, ncorporates existing techniques like variable ordering and parent selection, as well as adapting the core search process, to take advantage of the information within edge domains. Experiments in real scenarios such as protein–protein interaction graphs, co-authorship networks, and email networks, show that s faster than state-of-the-art systems varying the number of distinct vertex labels over the whole target graph and query sizes.
{"title":"ArcMatch: high-performance subgraph matching for labeled graphs by exploiting edge domains","authors":"Vincenzo Bonnici, Roberto Grasso, Giovanni Micale, Antonio di Maria, Dennis Shasha, Alfredo Pulvirenti, Rosalba Giugno","doi":"10.1007/s10618-024-01061-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01061-8","url":null,"abstract":"<p>Consider a large labeled graph (network), denoted the <i>target</i>. Subgraph matching is the problem of finding all instances of a small subgraph, denoted the <i>query</i>, in the target graph. Unlike the majority of existing methods that are restricted to graphs with labels solely on vertices, our proposed approach, named can effectively handle graphs with labels on both vertices and edges. ntroduces an efficient new vertex/edge domain data structure filtering procedure to speed up subgraph queries. The procedure, called path-based reduction, filters initial domains by scanning them for paths up to a specified length that appear in the query graph. Additionally, ncorporates existing techniques like variable ordering and parent selection, as well as adapting the core search process, to take advantage of the information within edge domains. Experiments in real scenarios such as protein–protein interaction graphs, co-authorship networks, and email networks, show that s faster than state-of-the-art systems varying the number of distinct vertex labels over the whole target graph and query sizes.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"42 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel methodology for analyzing temporal directional data with scatter and heavy tails. A hidden Markov model with contaminated von Mises-Fisher emission distribution is developed. The model is implemented using forward and backward selection approach that provides additional flexibility for contaminated as well as non-contaminated data. The utility of the method for finding homogeneous time blocks (regimes) is demonstrated on several experimental settings and two real-life text data sets containing presidential addresses and corporate financial statements respectively.
本文提出了一种分析具有散点和重尾的时间方向性数据的新方法。本文建立了一个具有受污染的 von Mises-Fisher 发射分布的隐马尔可夫模型。该模型采用前向和后向选择方法实现,为污染和非污染数据提供了额外的灵活性。在几个实验设置和两个分别包含总统讲话和公司财务报表的真实文本数据集上,演示了该方法在寻找同质时间块(制度)方面的实用性。
{"title":"On regime changes in text data using hidden Markov model of contaminated vMF distribution","authors":"Yingying Zhang, Shuchismita Sarkar, Yuanyuan Chen, Xuwen Zhu","doi":"10.1007/s10618-024-01051-w","DOIUrl":"https://doi.org/10.1007/s10618-024-01051-w","url":null,"abstract":"<p>This paper presents a novel methodology for analyzing temporal directional data with scatter and heavy tails. A hidden Markov model with contaminated von Mises-Fisher emission distribution is developed. The model is implemented using forward and backward selection approach that provides additional flexibility for contaminated as well as non-contaminated data. The utility of the method for finding homogeneous time blocks (regimes) is demonstrated on several experimental settings and two real-life text data sets containing presidential addresses and corporate financial statements respectively.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"21 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-02DOI: 10.1007/s10618-024-01057-4
Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith
We study the problem of predicting the next query to be recommended in interactive data exploratory analysis to guide users to correct content. Current query prediction approaches are based on sequence-to-sequence learning, exploiting past interaction data. However, due to the resource-hungry training process, such approaches fail to adapt to immediate user feedback. Immediate feedback is essential and considered as a signal of the user’s intent. We contribute with a novel query prediction ensemble mechanism, which adapts to immediate feedback relying on multi-armed bandits framework. Our mechanism, an extension to the popular Exp3 algorithm, augments Transformer-based language models for query predictions by combining predictions from experts, thus dynamically building a candidate set during exploration. Immediate feedback is leveraged to choose the appropriate prediction in a probabilistic fashion. We provide comprehensive large-scale experimental and comparative assessment using a popular online literature discovery service, which showcases that our mechanism (i) improves the per-round regret substantially against state-of-the-art Transformer-based models and (ii) shows the superiority of causal language modelling over masked language modelling for query recommendations.
{"title":"Sequential query prediction based on multi-armed bandits with ensemble of transformer experts and immediate feedback","authors":"Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith","doi":"10.1007/s10618-024-01057-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01057-4","url":null,"abstract":"<p>We study the problem of predicting the next query to be recommended in interactive data exploratory analysis to guide users to correct content. Current query prediction approaches are based on sequence-to-sequence learning, exploiting past interaction data. However, due to the resource-hungry training process, such approaches fail to adapt to immediate user feedback. Immediate feedback is essential and considered as a signal of the user’s intent. We contribute with a novel query prediction ensemble mechanism, which adapts to immediate feedback relying on multi-armed bandits framework. Our mechanism, an extension to the popular Exp3 algorithm, augments Transformer-based language models for query predictions by combining predictions from experts, thus dynamically building a candidate set during exploration. Immediate feedback is leveraged to choose the appropriate prediction in a probabilistic fashion. We provide comprehensive large-scale experimental and comparative assessment using a popular online literature discovery service, which showcases that our mechanism (i) improves the per-round regret substantially against state-of-the-art Transformer-based models and (ii) shows the superiority of causal language modelling over masked language modelling for query recommendations.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s10618-024-01041-y
Martin Atzmueller, Johannes Fürnkranz, Tomáš Kliegr, Ute Schmid
The growing number of applications of machine learning and data mining in many domains—from agriculture to business, education, industrial manufacturing, and medicine—gave rise to new requirements for how to inspect and control the learned models. The research domain of explainable artificial intelligence (XAI) has been newly established with a strong focus on methods being applied post-hoc on black-box models. As an alternative, the use of interpretable machine learning methods has been considered—where the learned models are white-box ones. Black-box models can be characterized as representing implicit knowledge—typically resulting from statistical and neural approaches of machine learning, while white-box models are explicit representations of knowledge—typically resulting from rule-learning approaches. In this introduction to the special issue on ‘Explainable and Interpretable Machine Learning and Data Mining’ we propose to bring together both perspectives, pointing out commonalities and discussing possibilities to integrate them.
{"title":"Explainable and interpretable machine learning and data mining","authors":"Martin Atzmueller, Johannes Fürnkranz, Tomáš Kliegr, Ute Schmid","doi":"10.1007/s10618-024-01041-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01041-y","url":null,"abstract":"<p>The growing number of applications of machine learning and data mining in many domains—from agriculture to business, education, industrial manufacturing, and medicine—gave rise to new requirements for how to inspect and control the learned models. The research domain of explainable artificial intelligence (XAI) has been newly established with a strong focus on methods being applied post-hoc on black-box models. As an alternative, the use of interpretable machine learning methods has been considered—where the learned models are white-box ones. Black-box models can be characterized as representing implicit knowledge—typically resulting from statistical and neural approaches of machine learning, while white-box models are explicit representations of knowledge—typically resulting from rule-learning approaches. In this introduction to the special issue on ‘Explainable and Interpretable Machine Learning and Data Mining’ we propose to bring together both perspectives, pointing out commonalities and discussing possibilities to integrate them.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-19DOI: 10.1007/s10618-024-01056-5
Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek
An outlier probability is the probability that an observation is an outlier. Typically, outlier detection algorithms calculate real-valued outlier scores to identify outliers. Converting outlier scores into outlier probabilities increases the interpretability of outlier scores for domain experts and makes outlier scores from different outlier detection algorithms comparable. Although several transformations to convert outlier scores to outlier probabilities have been proposed in the literature, there is no common understanding of good outlier probabilities and no standard approach to evaluate outlier probabilities. We require that good outlier probabilities be sharp, refined, and calibrated. To evaluate these properties, we adapt and propose novel measures that use ground-truth labels indicating which observation is an outlier or an inlier. The refinement and calibration measures partition the outlier probabilities into bins or use kernel smoothing. Compared to the evaluation of probability in supervised learning, several aspects are relevant when evaluating outlier probabilities, mainly due to the imbalanced and often unsupervised nature of outlier detection. First, stratified and weighted measures are necessary to evaluate the probabilities of outliers well. Second, the joint use of the sharpness, refinement, and calibration errors makes it possible to independently measure the corresponding characteristics of outlier probabilities. Third, equiareal bins, where the product of observations per bin times bin length is constant, balance the number of observations per bin and bin length, allowing accurate evaluation of different outlier probability ranges. Finally, we show that good outlier probabilities, according to the proposed measures, improve the performance of the follow-up task of converting outlier probabilities into labels for outliers and inliers.
{"title":"Evaluating outlier probabilities: assessing sharpness, refinement, and calibration using stratified and weighted measures","authors":"Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek","doi":"10.1007/s10618-024-01056-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01056-5","url":null,"abstract":"<p>An outlier probability is the probability that an observation is an outlier. Typically, outlier detection algorithms calculate real-valued outlier scores to identify outliers. Converting outlier scores into outlier probabilities increases the interpretability of outlier scores for domain experts and makes outlier scores from different outlier detection algorithms comparable. Although several transformations to convert outlier scores to outlier probabilities have been proposed in the literature, there is no common understanding of good outlier probabilities and no standard approach to evaluate outlier probabilities. We require that good outlier probabilities be sharp, refined, and calibrated. To evaluate these properties, we adapt and propose novel measures that use ground-truth labels indicating which observation is an outlier or an inlier. The refinement and calibration measures partition the outlier probabilities into bins or use kernel smoothing. Compared to the evaluation of probability in supervised learning, several aspects are relevant when evaluating outlier probabilities, mainly due to the imbalanced and often unsupervised nature of outlier detection. First, stratified and weighted measures are necessary to evaluate the probabilities of outliers well. Second, the joint use of the sharpness, refinement, and calibration errors makes it possible to independently measure the corresponding characteristics of outlier probabilities. Third, equiareal bins, where the product of observations per bin times bin length is constant, balance the number of observations per bin and bin length, allowing accurate evaluation of different outlier probability ranges. Finally, we show that good outlier probabilities, according to the proposed measures, improve the performance of the follow-up task of converting outlier probabilities into labels for outliers and inliers.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1007/s10618-024-01055-6
Sacha Corbugy, Rebecca Marion, Benoît Frénay
Dimensionality reduction (DR) is a popular technique that shows great results to analyze high-dimensional data. Generally, DR is used to produce visualizations in 2 or 3 dimensions. While it can help understanding correlations between data, embeddings generated by DR are hard to grasp. The position of instances in low-dimension may be difficult to interpret, especially for non-linear, non-parametric DR techniques. Because most of the techniques are said to be neighborhood preserving (which means that explaining long distances is not relevant), some approaches try explaining them locally. These methods use simpler interpretable models to approximate the decision frontier locally. This can lead to misleading explanations. In this paper a novel approach to locally explain non-linear, non-parametric DR embeddings like t-SNE is introduced. It is the first gradient-based method for explaining these DR algorithms. The technique presented in this paper is applied on t-SNE, but is theoretically suitable for any DR method that is a minimization or maximization problem. The approach uses the analytical derivative of a t-SNE embedding to explain the position of an instance in the visualization.
降维(DR)是一种流行的技术,在分析高维数据方面效果显著。一般来说,降维技术用于生成 2 维或 3 维的可视化数据。虽然降维有助于理解数据之间的相关性,但降维生成的嵌入却很难把握。低维实例的位置可能难以解释,特别是对于非线性、非参数 DR 技术而言。由于大多数技术都是邻域保留技术(这意味着解释长距离并不重要),因此有些方法会尝试在本地对其进行解释。这些方法使用较简单的可解释模型来局部近似决策前沿。这可能会导致误导性解释。本文介绍了一种局部解释非线性、非参数 DR 嵌入(如 t-SNE)的新方法。这是解释这些 DR 算法的第一种基于梯度的方法。本文介绍的技术适用于 t-SNE,但理论上适用于任何 DR 方法,即最小化或最大化问题。该方法使用 t-SNE 嵌入的分析导数来解释实例在可视化中的位置。
{"title":"Gradient-based explanation for non-linear non-parametric dimensionality reduction","authors":"Sacha Corbugy, Rebecca Marion, Benoît Frénay","doi":"10.1007/s10618-024-01055-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01055-6","url":null,"abstract":"<p>Dimensionality reduction (DR) is a popular technique that shows great results to analyze high-dimensional data. Generally, DR is used to produce visualizations in 2 or 3 dimensions. While it can help understanding correlations between data, embeddings generated by DR are hard to grasp. The position of instances in low-dimension may be difficult to interpret, especially for non-linear, non-parametric DR techniques. Because most of the techniques are said to be neighborhood preserving (which means that explaining long distances is not relevant), some approaches try explaining them locally. These methods use simpler interpretable models to approximate the decision frontier locally. This can lead to misleading explanations. In this paper a novel approach to locally explain non-linear, non-parametric DR embeddings like t-SNE is introduced. It is the first gradient-based method for explaining these DR algorithms. The technique presented in this paper is applied on t-SNE, but is theoretically suitable for any DR method that is a minimization or maximization problem. The approach uses the analytical derivative of a t-SNE embedding to explain the position of an instance in the visualization.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"112 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1007/s10618-024-01058-3
Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Peng, Huiyan Sun
Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment’s linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates decorrelated with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounding representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic and semi-synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC III and demonstrate a detailed causal relationship between red cell width distribution and mortality.
{"title":"De-confounding representation learning for counterfactual inference on continuous treatment via generative adversarial network","authors":"Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Peng, Huiyan Sun","doi":"10.1007/s10618-024-01058-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01058-3","url":null,"abstract":"<p>Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment’s linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates decorrelated with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounding representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic and semi-synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC III and demonstrate a detailed causal relationship between red cell width distribution and mortality.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"35 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}