Data Mining and Knowledge Discovery最新文献_第4页

Modeling the impact of out-of-schema questions in task-oriented dialog systems 面向任务的对话系统中模式外问题的影响建模

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-04 DOI: 10.1007/s10618-024-01039-6

Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis

Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.

现有的面向任务的对话系统一般假定，用户与系统的交互仅限于存储在封闭数据模式中的信息。但实际上，用户可能会提出 "模式外 "问题，即系统无法回答的问题，因为这些信息不存在于模式中。无法回答这些问题可能会导致用户在达到成功状态（如预订餐厅）之前退出聊天。一个关键的挑战是，这些问题的数量可能太多，领域专家无法一一回答。我们提出了模式外问题的检测和选择问题，以确定需要回答的最关键的模式外问题，从而最大限度地提高系统的预期成功率。我们提出了分两个阶段解决该问题的方法。在第一阶段，我们提出了一种新颖的上下文学习（ICL）方法来检测模式外问题。在第二阶段，我们提出了两种模式外问题选择（OQS）算法：一种是根据问题在退出对话中出现的频率来选择问题的天真方法，另一种是将每次对话表示为马尔科夫链并根据其总体收益来选择问题的概率方法。由于现有的数据集不包含模式外问题或用户放弃的问题，因此我们针对该问题提出并发布了两个新的数据集。我们对这些数据集进行了定量和模拟实验分析，衡量了我们的方法如何有效识别模式外问题并对系统的成功率产生积极影响。

{"title":"Modeling the impact of out-of-schema questions in task-oriented dialog systems","authors":"Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis","doi":"10.1007/s10618-024-01039-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01039-6","url":null,"abstract":"Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"43 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving graph-based recommendation with unraveled graph learning 利用未揭示图学习改进基于图的推荐

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-06-02 DOI: 10.1007/s10618-024-01038-7

Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen

Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the balance of the embedding from each layer in the output embedding. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named Unraveled Graph Contrastive Learning (UGCL), which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.

图协同过滤（GraphCF）利用图神经网络的推理能力，已成为推荐系统中一种很有前途的方法。此外，对比学习的整合也提高了 GraphCF 方法的性能。最近的研究已从图增强转向对比学习中的噪声扰动，从而显著提高了性能。然而，我们认为性能提升的主要因素不是图形增强或噪声扰动，而是输出嵌入中各层嵌入的平衡。为了证实我们的观点，我们用多种最先进的 GraphCF 方法进行了初步实验。基于我们的观察和见解，我们提出了一种名为 "未揭示图对比学习"（UGCL）的新方法，其中包括一种新的传播方案，以进一步提高性能。据我们所知，这是第一种专门解决输出嵌入中平衡因素以提高性能的方法。我们在多个大规模基准数据集上进行了广泛的实验，以评估我们提出的方法的有效性。结果表明，UGCL 明显优于所有其他最先进的基线模型，在公平性和去除杂能力方面也表现出优于其他基线模型的性能。

{"title":"Improving graph-based recommendation with unraveled graph learning","authors":"Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen","doi":"10.1007/s10618-024-01038-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01038-7","url":null,"abstract":"Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the balance of the embedding from each layer in the output embedding. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named Unraveled Graph Contrastive Learning (UGCL), which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141259415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A practical approach to novel class discovery in tabular data 在表格数据中发现新类别的实用方法

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-31 DOI: 10.1007/s10618-024-01025-y

Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine

The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.

新类别发现（NCD）问题包括从已标注的已知类别集合中提取知识，以准确划分未标注的新类别集合。虽然 NCD 最近受到了社会各界的广泛关注，但它通常是在计算机视觉问题和不现实的条件下解决的。特别是，新类别的数量通常被假定为事先已知，其标签有时被用于调整超参数。依赖于这些假设的方法并不适用于现实世界中的场景。在这项工作中，我们将重点关注在没有关于新类别的先验知识的情况下，如何解决表格数据中的 NCD 问题。为此，我们建议调整 NCD 方法的超参数，方法是调整 k 折交叉验证过程，并在每个折中隐藏一些已知类别。由于我们发现超参数过多的方法很可能会过度拟合这些隐藏类别，因此我们定义了一个简单的深度 NCD 模型。该方法仅由 NCD 问题所需的基本要素组成，并在现实条件下表现出稳健的性能。此外，我们发现该方法的潜在空间可用于可靠地估计新类别的数量。此外，我们还调整了两种无监督聚类算法（k-means 和光谱聚类），以充分利用已知类别的知识。我们在 7 个表格数据集上进行了广泛的实验，证明了所提方法和超参数调整过程的有效性，并表明无需依赖新类别知识也能解决 NCD 问题。

{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01025-y","url":null,"abstract":"The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bias-aware ranking from pairwise comparisons 通过成对比较进行有偏差的排序

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-31 DOI: 10.1007/s10618-024-01024-z

Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner

Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of ranking people (or items) from pairwise comparisons. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.

人类的反馈常常被直接或间接地用作算法决策的输入。然而，人类是有偏差的：如果将人类反馈作为输入的算法不能控制潜在的偏差，就可能导致算法决策的偏差，从而对人们的生活产生切实的影响。在本文中，我们将研究如何在通过成对比较对人（或物品）进行排序的任务中发现并纠正评价者的偏差。具体来说，我们假设有一组评价者对需要排名的项目进行成对比较。虽然评价者的成对评价应在一定程度上反映项目的潜在（不可观察的）真实质量分数，但它们可能会受到每个评价者自身对某些项目组的偏见或偏好的影响。通过检测和修正评估者的偏见，我们的目标是尽可能得出与获得潜在质量分数后得出的排序一致的项目排序。我们的建议是一种新颖的方法，它扩展了经典的布拉德利-特里模型，为每个评估者设置了一个偏差参数，该参数会根据项目所属的组别，扭曲每个项目的真实质量得分。得益于模型的简洁性，我们能够明确写出参数（即项目的潜在得分和评价者的偏差）的对数似然，并通过交替法进行优化。我们在合成数据和实际数据上进行的实验证实，我们的方法能够很好地重建每个评价者的偏差，因此在生成尽可能接近无偏差排名的任务中，我们的方法优于其他几个竞争对手。

{"title":"Bias-aware ranking from pairwise comparisons","authors":"Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner","doi":"10.1007/s10618-024-01024-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01024-z","url":null,"abstract":"Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of ranking people (or items) from pairwise comparisons. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"5 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LoCoMotif: discovering time-warped motifs in time series LoCoMotif：发现时间序列中的时间扭曲图案

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-30 DOI: 10.1007/s10618-024-01032-z

Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel

Time series motif discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.

时间序列图案发现（TSMD）是指识别在时间序列中多次出现（可能有细微变化）的图案的任务。所有现有的 TSMD 方法都有以下一个或多个局限性：它们只能寻找模式中最相似的两次出现；它们只能寻找预先指定的固定长度的模式；它们不能处理沿时间轴的变化；它们只能处理单变量时间序列。在本文中，我们提出了一种新方法 LoCoMotif，它不存在这些局限性。物理治疗中的一个具体使用案例激发了我们对该方法的兴趣。我们在这个案例中展示了所提方法的价值。我们还介绍了一种新的主题发现量化评估指标，以及用于比较 TSMD 方法的基准数据。LoCoMotif 不仅适用范围更广，而且大大优于现有方法。

引用次数: 0

On the impact of multi-dimensional local differential privacy on fairness 多维局部差异隐私对公平性的影响

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-27 DOI: 10.1007/s10618-024-01031-0

Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi

Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high (epsilon)), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.

在人们的生活中，越来越多地使用自动决策系统做出重要决定。由于操作数据和由此产生的决策具有敏感性，因此在适当使用此类技术时需要解决几个伦理问题，特别是公平性和隐私问题。以往的研究侧重于集中式差分隐私（DP）或针对单一敏感属性的局部隐私（LDP），与此不同的是，我们在本文中研究了 LDP 在存在多个敏感属性（即多维数据）的情况下对公平性的影响。对合成数据集和基准数据集的详细实证分析揭示了非常相关的观察结果。特别是：(1) 多维 LDP 是减少差异的有效方法；(2) LDP 多维方法的变体（我们采用了两种变体）只在低隐私保证（高）时才重要；(3) 真实决策分布对哪个群体对混淆更敏感有重要影响。最后，我们以建议的形式总结了我们的发现，以指导实践者在机器学习应用中采用有效的隐私保护实践，同时保持公平性和实用性。

{"title":"On the impact of multi-dimensional local differential privacy on fairness","authors":"Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi","doi":"10.1007/s10618-024-01031-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01031-0","url":null,"abstract":"Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high (epsilon)), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective interpretable learning for large-scale categorical data 针对大规模分类数据的有效可解释学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-27 DOI: 10.1007/s10618-024-01030-1

Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li

Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. wide learning). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or ANN (aka. deep learning). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. factorized learning). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, WD-Learning, xdeepFM, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—WBDF-Learning, based on the combination of wide, deep, factorization, and a newly introduced component named Broad Interaction network (BIN). BIN is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with wide, deep and factorized parts. We denote the learning of BIN parameters as broad learning. Additionally, the parameters of BIN are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from BIN, which can facilitate learning and provides a framework for knowledge-guided machine learning. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, WBDF framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.

大规模分类数据集在机器学习中无处不在，大多数已部署的机器学习模型的成功取决于如何有效地设计特征。对于大规模数据集，一般会使用参数方法，其中有三种特征工程策略比较常见。第一种策略侧重于管理网络的广度（或宽度），例如广义线性模型（又称广义学习）。第二种策略侧重于网络的深度，例如人工神经网络或 ANN（又称深度学习）。第三种策略依赖于将交互项因子化，例如因子化机器（又称因子化学习）。这些策略各有利弊。最近的研究表明，对于分类数据来说，将各种策略结合起来可以取得很好的效果。例如，WD-Learning、xdeepFM 等都能带来最先进的结果。顺应这一趋势，我们在这项工作中提出了另一种学习框架--WBDF-Learning，它基于广度、深度、因式分解和新引入的名为 "广度交互网络（BIN）"的组件的组合。BIN 是贝叶斯网络分类器的一种形式，其结构是先验学习的，而参数则是通过优化联合目标函数以及广度、深度和因子化部分来学习的。我们将 BIN 参数的学习称为广义学习。此外，BIN 的参数受限于实际概率，因此具有极高的可解释性。此外，人们还可以从 BIN 中采样或生成数据，这可以促进学习，并为知识引导的机器学习提供了一个框架。我们证明，我们提出的框架具有强大的复原力，在面对有偏见的数据集时仍能保持出色的分类性能。我们评估了我们的框架在各种基准大规模分类数据集上的分类性能，并与最先进的方法进行了比较。结果表明，WBDF 框架（a）在分类任务中表现出卓越的性能，（b）拥有出色的可解释性，（c）在涉及偏斜分布的情况下表现出非凡的弹性和有效性。

{"title":"Effective interpretable learning for large-scale categorical data","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li","doi":"10.1007/s10618-024-01030-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01030-1","url":null,"abstract":"Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. wide learning). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or ANN (aka. deep learning). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. factorized learning). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, WD-Learning, xdeepFM, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—WBDF-Learning, based on the combination of wide, deep, factorization, and a newly introduced component named Broad Interaction network (BIN). BIN is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with wide, deep and factorized parts. We denote the learning of BIN parameters as broad learning. Additionally, the parameters of BIN are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from BIN, which can facilitate learning and provides a framework for knowledge-guided machine learning. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, WBDF framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WaveLSea: helping experts interactively explore pattern mining search spaces WaveLSea：帮助专家交互式探索模式挖掘搜索空间

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-26 DOI: 10.1007/s10618-024-01037-8

Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali

This article presents the method Wave Top-k Random-d Lineage Search (WaveLSea) which guides an expert through data mining results according to her interest. The method exploits expert feedback, combined with the relation between patterns to spread the expert’s interest. It avoids the typical feature definition step commonly used in interactive data mining which limits the flexibility of the discovery process. We empirically demonstrate that WaveLSea returns the most relevant results for the user’s subjective interest. Even with imperfect feedback, WaveLSea behavior remains robust as it primarily still delivers most interesting results during experiments on graph-structured data. In order to assess the robustness of the method we design novel oracles called soothsayers giving imperfect feedback. Finally, we complete our quantitative study with a qualitative study using a user interface to evaluate WaveLSea.

本文介绍了波浪顶k随机世系搜索（WaveLSea）方法，该方法可根据专家的兴趣引导其浏览数据挖掘结果。该方法利用专家反馈，结合模式之间的关系来传播专家的兴趣。它避免了交互式数据挖掘中常用的典型特征定义步骤，这种步骤限制了发现过程的灵活性。我们通过经验证明，WaveLSea 可以返回与用户主观兴趣最相关的结果。即使在反馈不完善的情况下，WaveLSea 的行为仍然保持稳健，因为在对图结构数据进行实验时，它仍然主要提供最有趣的结果。为了评估该方法的鲁棒性，我们设计了一种名为 "占卜者 "的新方法，并给出了不完美的反馈。最后，我们通过使用用户界面来评估 WaveLSea 的定性研究完成了定量研究。

引用次数: 0

Active learning with biased non-response to label requests 有偏差地不回应标签请求的主动学习

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-25 DOI: 10.1007/s10618-024-01026-x

Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy

Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the Upper Confidence Bound of the Expected Utility (UCB-EU)–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.

主动学习可以通过识别信息量最大的新标签来提高预测模型的训练效率。然而，在现实世界中，不响应标签请求会影响主动学习的有效性。我们通过考虑数据中存在的非响应类型，将这种退化概念化，并证明有偏见的非响应对模型性能尤其不利。我们认为，在贴标签过程本质上依赖于用户交互的情况下，有偏差的非响应是很有可能发生的。为了减轻偏差非响应的影响，我们对采样策略提出了一种基于成本的修正方法--期望效用置信度上限（UCB-EU），这种方法可以合理地应用于任何主动学习算法。通过实验，我们证明了我们的方法在很多情况下都能成功地减少标签不响应带来的危害。不过，我们也描述了在 UCB-EU 下，对于特定的采样方法和数据生成过程，注释中的非响应偏差仍然有害的情况。最后，我们在一个电子商务平台的真实数据集上对我们的方法进行了评估。结果表明，UCB-EU 能大幅提高基于点击印象训练的转换模型的性能。总体而言，这项研究有助于更好地概念化非响应类型与通过主动学习改进模型之间的相互作用，并提供一种实用、易于实施的校正方法来减轻模型退化。

{"title":"Active learning with biased non-response to label requests","authors":"Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy","doi":"10.1007/s10618-024-01026-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01026-x","url":null,"abstract":"Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the Upper Confidence Bound of the Expected Utility (UCB-EU)–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"36 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

quant: a minimalist interval method for time series classification 量化：用于时间序列分类的最小区间法

IF 4.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery

Pub Date : 2024-05-22 DOI: 10.1007/s10618-024-01036-9

Angus Dempster, Daniel F. Schmidt, Geoffrey I. Webb

We show that it is possible to achieve the same accuracy, on average, as the most accurate existing interval methods for time series classification on a standard set of benchmark datasets using a single type of feature (quantiles), fixed intervals, and an ‘off the shelf’ classifier. This distillation of interval-based approaches represents a fast and accurate method for time series classification, achieving state-of-the-art accuracy on the expanded set of 142 datasets in the UCR archive with a total compute time (training and inference) of less than 15 min using a single CPU core.

我们的研究表明，在一组标准基准数据集上，使用单一类型的特征（量值）、固定区间和 "现成 "分类器，可以达到与现有最精确的时间序列分类区间方法相同的平均精确度。这种基于区间的提炼方法代表了一种快速、准确的时间序列分类方法，在 UCR 档案中的 142 个数据集扩展集上达到了最先进的准确度，使用单个 CPU 内核的总计算时间（训练和推理）不到 15 分钟。

引用次数: 0