Data Mining and Knowledge Discovery最新文献_第2页

Random walks with variable restarts for negative-example-informed label propagation 带可变重启的随机游走，用于负示例信息标签传播

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-13 DOI: 10.1007/s10618-024-01065-4

Sean Maxwell, Mehmet Koyutürk

Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.

在图的机器学习和数据挖掘应用中，经常会遇到标签传播问题，它既可以作为一个独立问题，也可以作为节点分类的一部分。许多标签传播算法利用随机行走（或网络传播），这种算法考虑负标签节点（即已知与相关标签无关的节点）的能力有限。纳入负标签节点的专门算法一般侧重于学习或重新调整边缘权重，以驱动行走远离负标签节点，转向正标签节点。这种方法有几个缺点，因为它增加了需要学习的参数数量，而且不一定能使行走远离负标签节点丰富的网络区域。我们对带有重启和网络传播的随机行走进行了重新表述，以实现 "可变重启"，即在遇到负标签节点时，增加在正标签节点重启的可能性。在此基础上，我们开发了 CusTaRd 算法，它有效地结合了可变重启概率和边缘重加权以避免负标签节点。为了评估 CusTaRd 的性能，我们在标签传播和节点分类算法基准测试中常用的网络数据集上进行了综合实验。我们的结果表明，CusTaRd 的性能始终优于学习边缘权重或重新启动轮廓的竞争算法，而且靠近正例的负例通常比距离较远的负例更有信息量。

{"title":"Random walks with variable restarts for negative-example-informed label propagation","authors":"Sean Maxwell, Mehmet Koyutürk","doi":"10.1007/s10618-024-01065-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01065-4","url":null,"abstract":"Label propagation is frequently encountered in machine learning and data mining applications on graphs, either as a standalone problem or as part of node classification. Many label propagation algorithms utilize random walks (or network propagation), which provide limited ability to take into account negatively-labeled nodes (i.e., nodes that are known to be not associated with the label of interest). Specialized algorithms to incorporate negatively-labeled nodes generally focus on learning or readjusting the edge weights to drive walks away from negatively-labeled nodes and toward positively-labeled nodes. This approach has several disadvantages, as it increases the number of parameters to be learned, and does not necessarily drive the walk away from regions of the network that are rich in negatively-labeled nodes. We reformulate random walk with restarts and network propagation to enable “variable restarts\", that is the increased likelihood of restarting at a positively-labeled node when a negatively-labeled node is encountered. Based on this reformulation, we develop CusTaRd, an algorithm that effectively combines variable restart probabilities and edge re-weighting to avoid negatively-labeled nodes. To assess the performance of CusTaRd, we perform comprehensive experiments on network datasets commonly used in benchmarking label propagation and node classification algorithms. Our results show that CusTaRd consistently outperforms competing algorithms that learn edge weights or restart profiles, and that negatives close to positive examples are generally more informative than more distant negatives.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"41 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142192895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discord-based counterfactual explanations for time series classification 基于不和谐的时间序列分类反事实解释

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-07 DOI: 10.1007/s10618-024-01028-9

Omar Bahri, Peiyu Li, Soukaina Filali Boubrahimi, Shah Muhammad Hamdi

The opacity inherent in machine learning models presents a significant hindrance to their widespread incorporation into decision-making processes. To address this challenge and foster trust among stakeholders while ensuring decision fairness, the data mining community has been actively advancing the explainable artificial intelligence paradigm. This paper contributes to the evolving field by focusing on counterfactual generation for time series classification models, a domain where research is relatively scarce. We develop, a post-hoc, model agnostic counterfactual explanation algorithm that leverages the Matrix Profile to map time series discords to their nearest neighbors in a target sequence and use this mapping to generate new counterfactual instances. To our knowledge, this is the first effort towards the use of time series discords for counterfactual explanations. We evaluate our algorithm on the University of California Riverside and University of East Anglia archives and compare it to three state-of-the-art univariate and multivariate methods.

机器学习模型固有的不透明性严重阻碍了其广泛应用于决策过程。为了应对这一挑战，促进利益相关者之间的信任，同时确保决策的公平性，数据挖掘界一直在积极推进可解释人工智能范式。本文将重点关注时间序列分类模型的反事实生成，为这一研究相对匮乏的领域做出贡献。我们开发了一种事后的、与模型无关的反事实解释算法，该算法利用矩阵轮廓将时间序列不和谐映射到目标序列中的近邻，并利用这种映射生成新的反事实实例。据我们所知，这是利用时间序列不协调进行反事实解释的首次尝试。我们在加州大学河滨分校和东英吉利大学的档案中评估了我们的算法，并将其与三种最先进的单变量和多变量方法进行了比较。

引用次数: 0

ArcMatch: high-performance subgraph matching for labeled graphs by exploiting edge domains ArcMatch：利用边域为带标记图提供高性能子图匹配

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-07 DOI: 10.1007/s10618-024-01061-8

Vincenzo Bonnici, Roberto Grasso, Giovanni Micale, Antonio di Maria, Dennis Shasha, Alfredo Pulvirenti, Rosalba Giugno

Consider a large labeled graph (network), denoted the target. Subgraph matching is the problem of finding all instances of a small subgraph, denoted the query, in the target graph. Unlike the majority of existing methods that are restricted to graphs with labels solely on vertices, our proposed approach, named can effectively handle graphs with labels on both vertices and edges. ntroduces an efficient new vertex/edge domain data structure filtering procedure to speed up subgraph queries. The procedure, called path-based reduction, filters initial domains by scanning them for paths up to a specified length that appear in the query graph. Additionally, ncorporates existing techniques like variable ordering and parent selection, as well as adapting the core search process, to take advantage of the information within edge domains. Experiments in real scenarios such as protein–protein interaction graphs, co-authorship networks, and email networks, show that s faster than state-of-the-art systems varying the number of distinct vertex labels over the whole target graph and query sizes.

考虑一个大型标注图（网络），称为目标图。子图匹配是在目标图中找到一个小子图（表示查询）的所有实例的问题。现有的大多数方法都局限于只在顶点上有标签的图，而我们提出的方法则不同，它能有效处理顶点和边上都有标签的图。该程序称为基于路径的缩减，通过扫描查询图中出现的指定长度的路径来过滤初始域。此外，该方法还结合了变量排序和父级选择等现有技术，并调整了核心搜索过程，以利用边域内的信息。在蛋白质-蛋白质相互作用图、共同作者网络和电子邮件网络等实际场景中的实验表明，在整个目标图上改变不同顶点标签的数量和查询大小，ncorporate 的速度比最先进的系统更快。

{"title":"ArcMatch: high-performance subgraph matching for labeled graphs by exploiting edge domains","authors":"Vincenzo Bonnici, Roberto Grasso, Giovanni Micale, Antonio di Maria, Dennis Shasha, Alfredo Pulvirenti, Rosalba Giugno","doi":"10.1007/s10618-024-01061-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01061-8","url":null,"abstract":"Consider a large labeled graph (network), denoted the target. Subgraph matching is the problem of finding all instances of a small subgraph, denoted the query, in the target graph. Unlike the majority of existing methods that are restricted to graphs with labels solely on vertices, our proposed approach, named can effectively handle graphs with labels on both vertices and edges. ntroduces an efficient new vertex/edge domain data structure filtering procedure to speed up subgraph queries. The procedure, called path-based reduction, filters initial domains by scanning them for paths up to a specified length that appear in the query graph. Additionally, ncorporates existing techniques like variable ordering and parent selection, as well as adapting the core search process, to take advantage of the information within edge domains. Experiments in real scenarios such as protein–protein interaction graphs, co-authorship networks, and email networks, show that s faster than state-of-the-art systems varying the number of distinct vertex labels over the whole target graph and query sizes.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"42 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On regime changes in text data using hidden Markov model of contaminated vMF distribution 利用受污染 vMF 分布的隐马尔可夫模型研究文本数据中的制度变化

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-03 DOI: 10.1007/s10618-024-01051-w

Yingying Zhang, Shuchismita Sarkar, Yuanyuan Chen, Xuwen Zhu

This paper presents a novel methodology for analyzing temporal directional data with scatter and heavy tails. A hidden Markov model with contaminated von Mises-Fisher emission distribution is developed. The model is implemented using forward and backward selection approach that provides additional flexibility for contaminated as well as non-contaminated data. The utility of the method for finding homogeneous time blocks (regimes) is demonstrated on several experimental settings and two real-life text data sets containing presidential addresses and corporate financial statements respectively.

本文提出了一种分析具有散点和重尾的时间方向性数据的新方法。本文建立了一个具有受污染的 von Mises-Fisher 发射分布的隐马尔可夫模型。该模型采用前向和后向选择方法实现，为污染和非污染数据提供了额外的灵活性。在几个实验设置和两个分别包含总统讲话和公司财务报表的真实文本数据集上，演示了该方法在寻找同质时间块（制度）方面的实用性。

引用次数: 0

Sequential query prediction based on multi-armed bandits with ensemble of transformer experts and immediate feedback 基于多臂匪帮的序列查询预测与变压器专家集合和即时反馈

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-08-02 DOI: 10.1007/s10618-024-01057-4

Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith

We study the problem of predicting the next query to be recommended in interactive data exploratory analysis to guide users to correct content. Current query prediction approaches are based on sequence-to-sequence learning, exploiting past interaction data. However, due to the resource-hungry training process, such approaches fail to adapt to immediate user feedback. Immediate feedback is essential and considered as a signal of the user’s intent. We contribute with a novel query prediction ensemble mechanism, which adapts to immediate feedback relying on multi-armed bandits framework. Our mechanism, an extension to the popular Exp3 algorithm, augments Transformer-based language models for query predictions by combining predictions from experts, thus dynamically building a candidate set during exploration. Immediate feedback is leveraged to choose the appropriate prediction in a probabilistic fashion. We provide comprehensive large-scale experimental and comparative assessment using a popular online literature discovery service, which showcases that our mechanism (i) improves the per-round regret substantially against state-of-the-art Transformer-based models and (ii) shows the superiority of causal language modelling over masked language modelling for query recommendations.

我们研究的问题是在交互式数据探索分析中预测下一个要推荐的查询，以引导用户找到正确的内容。目前的查询预测方法基于序列到序列学习，利用过去的交互数据。然而，由于训练过程耗费大量资源，这些方法无法适应即时的用户反馈。即时反馈至关重要，被视为用户意图的信号。我们提出了一种新颖的查询预测集合机制，该机制依靠多臂匪徒框架来适应即时反馈。我们的机制是对流行的 Exp3 算法的扩展，通过结合专家的预测来增强基于 Transformer 的查询预测语言模型，从而在探索过程中动态地建立候选集。即时反馈被用来以概率方式选择适当的预测。我们利用一个流行的在线文献发现服务进行了全面的大规模实验和比较评估，结果表明我们的机制（i）与最先进的基于 Transformer 的模型相比，大大改善了每轮遗憾；（ii）显示了因果语言建模比屏蔽语言建模在查询推荐方面的优越性。

{"title":"Sequential query prediction based on multi-armed bandits with ensemble of transformer experts and immediate feedback","authors":"Shameem A. Puthiya Parambath, Christos Anagnostopoulos, Roderick Murray-Smith","doi":"10.1007/s10618-024-01057-4","DOIUrl":"https://doi.org/10.1007/s10618-024-01057-4","url":null,"abstract":"We study the problem of predicting the next query to be recommended in interactive data exploratory analysis to guide users to correct content. Current query prediction approaches are based on sequence-to-sequence learning, exploiting past interaction data. However, due to the resource-hungry training process, such approaches fail to adapt to immediate user feedback. Immediate feedback is essential and considered as a signal of the user’s intent. We contribute with a novel query prediction ensemble mechanism, which adapts to immediate feedback relying on multi-armed bandits framework. Our mechanism, an extension to the popular Exp3 algorithm, augments Transformer-based language models for query predictions by combining predictions from experts, thus dynamically building a candidate set during exploration. Immediate feedback is leveraged to choose the appropriate prediction in a probabilistic fashion. We provide comprehensive large-scale experimental and comparative assessment using a popular online literature discovery service, which showcases that our mechanism (i) improves the per-round regret substantially against state-of-the-art Transformer-based models and (ii) shows the superiority of causal language modelling over masked language modelling for query recommendations.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explainable and interpretable machine learning and data mining 可解释和可说明的机器学习和数据挖掘

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-30 DOI: 10.1007/s10618-024-01041-y

Martin Atzmueller, Johannes Fürnkranz, Tomáš Kliegr, Ute Schmid

The growing number of applications of machine learning and data mining in many domains—from agriculture to business, education, industrial manufacturing, and medicine—gave rise to new requirements for how to inspect and control the learned models. The research domain of explainable artificial intelligence (XAI) has been newly established with a strong focus on methods being applied post-hoc on black-box models. As an alternative, the use of interpretable machine learning methods has been considered—where the learned models are white-box ones. Black-box models can be characterized as representing implicit knowledge—typically resulting from statistical and neural approaches of machine learning, while white-box models are explicit representations of knowledge—typically resulting from rule-learning approaches. In this introduction to the special issue on ‘Explainable and Interpretable Machine Learning and Data Mining’ we propose to bring together both perspectives, pointing out commonalities and discussing possibilities to integrate them.

机器学习和数据挖掘在农业、商业、教育、工业制造和医疗等众多领域的应用日益增多，这就对如何检查和控制所学模型提出了新的要求。可解释人工智能（XAI）的研究领域刚刚建立，重点关注在黑盒模型上事后应用的方法。作为一种替代方法，人们考虑使用可解释的机器学习方法--学习到的模型是白盒模型。黑箱模型的特点是代表隐性知识--通常产生于机器学习的统计和神经方法，而白箱模型则是知识的显性代表--通常产生于规则学习方法。在这篇 "可解释和可解释的机器学习与数据挖掘 "特刊导言中，我们建议将这两种观点结合起来，指出它们的共同点，并讨论将它们整合的可能性。

引用次数: 0

Evaluating outlier probabilities: assessing sharpness, refinement, and calibration using stratified and weighted measures 评估离群值概率：使用分层和加权测量法评估清晰度、精细度和校准度

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-19 DOI: 10.1007/s10618-024-01056-5

Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek

An outlier probability is the probability that an observation is an outlier. Typically, outlier detection algorithms calculate real-valued outlier scores to identify outliers. Converting outlier scores into outlier probabilities increases the interpretability of outlier scores for domain experts and makes outlier scores from different outlier detection algorithms comparable. Although several transformations to convert outlier scores to outlier probabilities have been proposed in the literature, there is no common understanding of good outlier probabilities and no standard approach to evaluate outlier probabilities. We require that good outlier probabilities be sharp, refined, and calibrated. To evaluate these properties, we adapt and propose novel measures that use ground-truth labels indicating which observation is an outlier or an inlier. The refinement and calibration measures partition the outlier probabilities into bins or use kernel smoothing. Compared to the evaluation of probability in supervised learning, several aspects are relevant when evaluating outlier probabilities, mainly due to the imbalanced and often unsupervised nature of outlier detection. First, stratified and weighted measures are necessary to evaluate the probabilities of outliers well. Second, the joint use of the sharpness, refinement, and calibration errors makes it possible to independently measure the corresponding characteristics of outlier probabilities. Third, equiareal bins, where the product of observations per bin times bin length is constant, balance the number of observations per bin and bin length, allowing accurate evaluation of different outlier probability ranges. Finally, we show that good outlier probabilities, according to the proposed measures, improve the performance of the follow-up task of converting outlier probabilities into labels for outliers and inliers.

离群值概率是观测值成为离群值的概率。通常，离群值检测算法会计算实值离群值分数来识别离群值。将离群值分数转换为离群值概率，可提高领域专家对离群值分数的可解释性，并使不同离群值检测算法的离群值分数具有可比性。虽然文献中已经提出了几种将离群点分数转换为离群点概率的转换方法，但对于好的离群点概率还没有达成共识，也没有评估离群点概率的标准方法。我们要求好的离群值概率是敏锐的、细化的和校准的。为了评估这些特性，我们调整并提出了新的测量方法，使用地面实况标签来指示哪个观测值是离群值或离群值。细化和校准方法将离群值概率划分为不同的等级，或使用核平滑法。与监督学习中的概率评估相比，在评估离群值概率时，有几个方面是相关的，这主要是由于离群值检测的不平衡性和通常的无监督性。首先，要很好地评估离群值的概率，分层和加权测量是必要的。其次，联合使用锐化、细化和校准误差可以独立测量离群值概率的相应特征。第三，等实数分仓（每个分仓的观测值乘以分仓长度的乘积为常数）平衡了每个分仓的观测值数量和分仓长度，从而可以准确评估不同的离群值概率范围。最后，我们表明，根据所提出的测量方法，良好的离群值概率可以提高将离群值概率转换为离群值和异常值标签的后续任务的性能。

{"title":"Evaluating outlier probabilities: assessing sharpness, refinement, and calibration using stratified and weighted measures","authors":"Philipp Röchner, Henrique O. Marques, Ricardo J. G. B. Campello, Arthur Zimek","doi":"10.1007/s10618-024-01056-5","DOIUrl":"https://doi.org/10.1007/s10618-024-01056-5","url":null,"abstract":"An outlier probability is the probability that an observation is an outlier. Typically, outlier detection algorithms calculate real-valued outlier scores to identify outliers. Converting outlier scores into outlier probabilities increases the interpretability of outlier scores for domain experts and makes outlier scores from different outlier detection algorithms comparable. Although several transformations to convert outlier scores to outlier probabilities have been proposed in the literature, there is no common understanding of good outlier probabilities and no standard approach to evaluate outlier probabilities. We require that good outlier probabilities be sharp, refined, and calibrated. To evaluate these properties, we adapt and propose novel measures that use ground-truth labels indicating which observation is an outlier or an inlier. The refinement and calibration measures partition the outlier probabilities into bins or use kernel smoothing. Compared to the evaluation of probability in supervised learning, several aspects are relevant when evaluating outlier probabilities, mainly due to the imbalanced and often unsupervised nature of outlier detection. First, stratified and weighted measures are necessary to evaluate the probabilities of outliers well. Second, the joint use of the sharpness, refinement, and calibration errors makes it possible to independently measure the corresponding characteristics of outlier probabilities. Third, equiareal bins, where the product of observations per bin times bin length is constant, balance the number of observations per bin and bin length, allowing accurate evaluation of different outlier probability ranges. Finally, we show that good outlier probabilities, according to the proposed measures, improve the performance of the follow-up task of converting outlier probabilities into labels for outliers and inliers.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Gradient-based explanation for non-linear non-parametric dimensionality reduction 基于梯度的非线性非参数降维解释

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-11 DOI: 10.1007/s10618-024-01055-6

Sacha Corbugy, Rebecca Marion, Benoît Frénay

Dimensionality reduction (DR) is a popular technique that shows great results to analyze high-dimensional data. Generally, DR is used to produce visualizations in 2 or 3 dimensions. While it can help understanding correlations between data, embeddings generated by DR are hard to grasp. The position of instances in low-dimension may be difficult to interpret, especially for non-linear, non-parametric DR techniques. Because most of the techniques are said to be neighborhood preserving (which means that explaining long distances is not relevant), some approaches try explaining them locally. These methods use simpler interpretable models to approximate the decision frontier locally. This can lead to misleading explanations. In this paper a novel approach to locally explain non-linear, non-parametric DR embeddings like t-SNE is introduced. It is the first gradient-based method for explaining these DR algorithms. The technique presented in this paper is applied on t-SNE, but is theoretically suitable for any DR method that is a minimization or maximization problem. The approach uses the analytical derivative of a t-SNE embedding to explain the position of an instance in the visualization.

降维（DR）是一种流行的技术，在分析高维数据方面效果显著。一般来说，降维技术用于生成 2 维或 3 维的可视化数据。虽然降维有助于理解数据之间的相关性，但降维生成的嵌入却很难把握。低维实例的位置可能难以解释，特别是对于非线性、非参数 DR 技术而言。由于大多数技术都是邻域保留技术（这意味着解释长距离并不重要），因此有些方法会尝试在本地对其进行解释。这些方法使用较简单的可解释模型来局部近似决策前沿。这可能会导致误导性解释。本文介绍了一种局部解释非线性、非参数 DR 嵌入（如 t-SNE）的新方法。这是解释这些 DR 算法的第一种基于梯度的方法。本文介绍的技术适用于 t-SNE，但理论上适用于任何 DR 方法，即最小化或最大化问题。该方法使用 t-SNE 嵌入的分析导数来解释实例在可视化中的位置。

{"title":"Gradient-based explanation for non-linear non-parametric dimensionality reduction","authors":"Sacha Corbugy, Rebecca Marion, Benoît Frénay","doi":"10.1007/s10618-024-01055-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01055-6","url":null,"abstract":"Dimensionality reduction (DR) is a popular technique that shows great results to analyze high-dimensional data. Generally, DR is used to produce visualizations in 2 or 3 dimensions. While it can help understanding correlations between data, embeddings generated by DR are hard to grasp. The position of instances in low-dimension may be difficult to interpret, especially for non-linear, non-parametric DR techniques. Because most of the techniques are said to be neighborhood preserving (which means that explaining long distances is not relevant), some approaches try explaining them locally. These methods use simpler interpretable models to approximate the decision frontier locally. This can lead to misleading explanations. In this paper a novel approach to locally explain non-linear, non-parametric DR embeddings like t-SNE is introduced. It is the first gradient-based method for explaining these DR algorithms. The technique presented in this paper is applied on t-SNE, but is theoretically suitable for any DR method that is a minimization or maximization problem. The approach uses the analytical derivative of a t-SNE embedding to explain the position of an instance in the visualization.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"112 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

De-confounding representation learning for counterfactual inference on continuous treatment via generative adversarial network 通过生成式对抗网络对连续治疗的反事实推断进行去混淆表征学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-11 DOI: 10.1007/s10618-024-01058-3

Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Peng, Huiyan Sun

Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment’s linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates decorrelated with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounding representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic and semi-synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC III and demonstrate a detailed causal relationship between red cell width distribution and mortality.

在现实世界的因果推断任务中，连续而非二元处理变量的反事实推断更为常见。虽然目前已经有一些基于边际结构模型的样本重权方法来消除混杂偏差，但这些方法通常侧重于消除处理对混杂因素的线性依赖，并依赖于假定参数模型的准确性，而这些模型通常是不可验证的。在本文中，我们提出了一种去混杂表征学习（DRL）框架，通过生成与治疗变量不相关的协变量的表征，对连续治疗进行反事实结果估计。DRL 是一种非参数模型，可以消除治疗与协变量之间的线性和非线性依赖关系。具体来说，我们将去混杂表征与治疗变量之间的相关性与协变量表征与治疗变量之间的相关性进行对比训练，以消除混杂偏差。此外，我们还在框架中嵌入了一个反事实推理网络，使学习到的表征既能用于去混淆，也能用于可信推理。在合成和半合成数据集上进行的大量实验表明，DRL 模型在学习去混淆表征方面表现出色，在连续处理变量方面优于最先进的反事实推断模型。此外，我们还将 DRL 模型应用于真实世界的医疗数据集 MIMIC III，并证明了红细胞宽度分布与死亡率之间的详细因果关系。

{"title":"De-confounding representation learning for counterfactual inference on continuous treatment via generative adversarial network","authors":"Yonghe Zhao, Qiang Huang, Haolong Zeng, Yun Peng, Huiyan Sun","doi":"10.1007/s10618-024-01058-3","DOIUrl":"https://doi.org/10.1007/s10618-024-01058-3","url":null,"abstract":"Counterfactual inference for continuous rather than binary treatment variables is more common in real-world causal inference tasks. While there are already some sample reweighting methods based on Marginal Structural Model for eliminating the confounding bias, they generally focus on removing the treatment’s linear dependence on confounders and rely on the accuracy of the assumed parametric models, which are usually unverifiable. In this paper, we propose a de-confounding representation learning (DRL) framework for counterfactual outcome estimation of continuous treatment by generating the representations of covariates decorrelated with the treatment variables. The DRL is a non-parametric model that eliminates both linear and nonlinear dependence between treatment and covariates. Specifically, we train the correlations between the de-confounding representations and the treatment variables against the correlations between the covariate representations and the treatment variables to eliminate confounding bias. Further, a counterfactual inference network is embedded into the framework to make the learned representations serve both de-confounding and trusted inference. Extensive experiments on synthetic and semi-synthetic datasets show that the DRL model performs superiorly in learning de-confounding representations and outperforms state-of-the-art counterfactual inference models for continuous treatment variables. In addition, we apply the DRL model to a real-world medical dataset MIMIC III and demonstrate a detailed causal relationship between red cell width distribution and mortality.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"35 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN 加强种族主义分类：使用自我训练和 CNN 的自动多语言数据注释系统

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-07-11 DOI: 10.1007/s10618-024-01059-2

Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir, Nikola S. Nikolov, Jamal Kharroubi

Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.

在社交媒体上，准确的种族主义分类至关重要，因为种族主义和歧视性内容会对个人和社会造成伤害。自动检测种族主义需要收集和注释大量不同的代表性数据，作为系统的重要信息来源。然而，事实证明这项任务对时间和资源的要求都很高，导致整个过程耗资巨大。此外，由于每种语言都有其独特的文化内涵和词汇，种族主义在不同语言中的表现形式也不尽相同。这就需要有母语的信息资源来有效地检测种族主义，这就使构建一个明确用于识别社交媒体平台上种族主义的数据库变得更加复杂。本研究介绍了一种用于种族主义分类的自动数据注释系统，该系统利用自我训练和基于句子-贝特（SBERT）转换器的数据表示模型与卷积神经网络（CNN）模型相结合的方法。该系统有助于创建一个多语言种族主义数据集，该数据集由从 Facebook 和 Twitter 收集的 26,866 个实例组成。这是通过一个自我训练过程实现的，该过程利用数据集的标注子集来注释剩余的未标注数据。研究考察了自我训练对系统性能的影响，发现模型的有效性有了显著提高。特别是在英语数据集上，该系统的准确率达到了 92.53%，F 分数达到了 88.26%。法文数据集的准确率为 93.64%，F-score 为 92.68%。同样，阿拉伯语数据集的准确率达到 91.03%，F-score 值为 92.15%。本研究表明，实施自我训练后，准确率和 F 分数显著提高了 8-12%。

{"title":"Enhancing racism classification: an automatic multilingual data annotation system using self-training and CNN","authors":"Ikram El Miqdadi, Soufiane Hourri, Fatima Zahra El Idrysy, Assia Hayati, Yassine Namir, Nikola S. Nikolov, Jamal Kharroubi","doi":"10.1007/s10618-024-01059-2","DOIUrl":"https://doi.org/10.1007/s10618-024-01059-2","url":null,"abstract":"Accurate racism classification is crucial on social media, where racist and discriminatory content can harm individuals and society. Automated racism detection requires gathering and annotating a wide range of diverse and representative data as an essential source of information for the system. However, this task proves to be highly demanding in both time and resources, resulting in a significantly costly process. Moreover, racism can appear differently across languages because of the distinct cultural subtleties and vocabularies linked to each language. This necessitates having information resources in native languages to effectively detect racism, which further complicates constructing a database explicitly designed for identifying racism on social media platforms. In this study, an automated data annotation system for racism classification is presented, utilizing self-training and a combination of the Sentence-BERT (SBERT) transformers-based model for data representation and a Convolutional Neural Network (CNN) model. The system aids in the creation of a multilingual racism dataset consisting of 26,866 instances gathered from Facebook and Twitter. This is achieved through a self-training process that utilizes a labeled subset of the dataset to annotate the remaining unlabeled data. The study examines the impact of self-training on the system’s performance, revealing significant enhancements in model effectiveness. Especially for the English dataset, the system achieves a noteworthy accuracy rate of 92.53% and an F-score of 88.26%. The French dataset reaches an accuracy of 93.64% and an F-score of 92.68%. Similarly, for the Arabic dataset, the accuracy reaches 91.03%, accompanied by an F-score value of 92.15%. The implementation of self-training results in a remarkable 8–12% improvement in accuracy and F-score, as demonstrated in this study.\u0000","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"31 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0