Pub Date : 2024-06-04DOI: 10.1007/s10618-024-01039-6
Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis
Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.
{"title":"Modeling the impact of out-of-schema questions in task-oriented dialog systems","authors":"Jannat Ara Meem, Muhammad Shihab Rashid, Vagelis Hristidis","doi":"10.1007/s10618-024-01039-6","DOIUrl":"https://doi.org/10.1007/s10618-024-01039-6","url":null,"abstract":"<p>Existing work on task-oriented dialog systems generally assumes that the interaction of users with the system is restricted to the information stored in a closed data schema. However, in practice users may ask ‘out-of-schema’ questions, that is, questions that the system cannot answer, because the information does not exist in the schema. Failure to answer these questions may lead the users to drop out of the chat before reaching the success state (e.g. reserving a restaurant). A key challenge is that the number of these questions may be too high for a domain expert to answer them all. We formulate the problem of out-of-schema question detection and selection that identifies the most critical out-of-schema questions to answer, in order to maximize the expected success rate of the system. We propose a two-stage pipeline to solve the problem. In the first stage, we propose a novel in-context learning (ICL) approach to detect out-of-schema questions. In the second stage, we propose two algorithms for out-of-schema question selection (OQS): a naive approach that chooses a question based on its frequency in the dropped-out conversations, and a probabilistic approach that represents each conversation as a Markov chain and a question is picked based on its overall benefit. We propose and publish two new datasets for the problem, as existing datasets do not contain out-of-schema questions or user drop-outs. Our quantitative and simulation-based experimental analyses on these datasets measure how our methods can effectively identify out-of-schema questions and positively impact the success rate of the system.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"43 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141258149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the balance of the embedding from each layer in the output embedding. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named Unraveled Graph Contrastive Learning (UGCL), which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.
{"title":"Improving graph-based recommendation with unraveled graph learning","authors":"Chih-Chieh Chang, Diing-Ruey Tzeng, Chia-Hsun Lu, Ming-Yi Chang, Chih-Ya Shen","doi":"10.1007/s10618-024-01038-7","DOIUrl":"https://doi.org/10.1007/s10618-024-01038-7","url":null,"abstract":"<p>Graph Collaborative Filtering (GraphCF) has emerged as a promising approach in recommendation systems, leveraging the inferential power of Graph Neural Networks. Furthermore, the integration of contrastive learning has enhanced the performance of GraphCF methods. Recent research has shifted from graph augmentation to noise perturbation in contrastive learning, leading to significant performance improvements. However, we contend that the primary factor in performance enhancement is not graph augmentation or noise perturbation, but rather the <i>balance of the embedding from each layer in the output embedding</i>. To substantiate our claim, we conducted preliminary experiments with multiple state-of-the-art GraphCF methods. Based on our observations and insights, we propose a novel approach named <i>Unraveled Graph Contrastive Learning (UGCL)</i>, which includes a new propagation scheme to further enhance performance. To the best of our knowledge, this is the first approach that specifically addresses the balance factor in the output embedding for performance improvement. We have carried out extensive experiments on multiple large-scale benchmark datasets to evaluate the effectiveness of our proposed approach. The results indicate that UGCL significantly outperforms all other state-of-the-art baseline models, also showing superior performance in terms of fairness and debiasing capabilities compared to other baselines.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"30 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141259415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the k-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (k-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.
{"title":"A practical approach to novel class discovery in tabular data","authors":"Troisemaine Colin, Reiffers-Masson Alexandre, Gosselin Stéphane, Lemaire Vincent, Vaton Sandrine","doi":"10.1007/s10618-024-01025-y","DOIUrl":"https://doi.org/10.1007/s10618-024-01025-y","url":null,"abstract":"<p>The problem of novel class discovery (NCD) consists in extracting knowledge from a labeled set of known classes to accurately partition an unlabeled set of novel classes. While NCD has recently received a lot of attention from the community, it is often solved on computer vision problems and under unrealistic conditions. In particular, the number of novel classes is usually assumed to be known in advance, and their labels are sometimes used to tune hyperparameters. Methods that rely on these assumptions are not applicable in real-world scenarios. In this work, we focus on solving NCD in tabular data when no prior knowledge of the novel classes is available. To this end, we propose to tune the hyperparameters of NCD methods by adapting the <i>k</i>-fold cross-validation process and hiding some of the known classes in each fold. Since we have found that methods with too many hyperparameters are likely to overfit these hidden classes, we define a simple deep NCD model. This method is composed of only the essential elements necessary for the NCD problem and shows robust performance under realistic conditions. Furthermore, we find that the latent space of this method can be used to reliably estimate the number of novel classes. Additionally, we adapt two unsupervised clustering algorithms (<i>k</i>-means and Spectral Clustering) to leverage the knowledge of the known classes. Extensive experiments are conducted on 7 tabular datasets and demonstrate the effectiveness of the proposed method and hyperparameter tuning process, and show that the NCD problem can be solved without relying on knowledge from the novel classes.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"123 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1007/s10618-024-01024-z
Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner
Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of ranking people (or items) from pairwise comparisons. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.
{"title":"Bias-aware ranking from pairwise comparisons","authors":"Antonio Ferrara, Francesco Bonchi, Francesco Fabbri, Fariba Karimi, Claudia Wagner","doi":"10.1007/s10618-024-01024-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01024-z","url":null,"abstract":"<p>Human feedback is often used, either directly or indirectly, as input to algorithmic decision making. However, humans are biased: if the algorithm that takes as input the human feedback does not control for potential biases, this might result in biased algorithmic decision making, which can have a tangible impact on people’s lives. In this paper, we study how to detect and correct for evaluators’ bias in the task of <i>ranking people (or items) from pairwise comparisons</i>. Specifically, we assume we are given pairwise comparisons of the items to be ranked produced by a set of evaluators. While the pairwise assessments of the evaluators should reflect to a certain extent the latent (unobservable) true quality scores of the items, they might be affected by each evaluator’s own bias against, or in favor, of some groups of items. By detecting and amending evaluators’ biases, we aim to produce a ranking of the items that is, as much as possible, in accordance with the ranking one would produce by having access to the latent quality scores. Our proposal is a novel method that extends the classic Bradley-Terry model by having a bias parameter for each evaluator which distorts the true quality score of each item, depending on the group the item belongs to. Thanks to the simplicity of the model, we are able to write explicitly its log-likelihood w.r.t. the parameters (i.e., items’ latent scores and evaluators’ bias) and optimize by means of the alternating approach. Our experiments on synthetic and real-world data confirm that our method is able to reconstruct the bias of each single evaluator extremely well and thus to outperform several non-trivial competitors in the task of producing a ranking which is as much as possible close to the unbiased ranking.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"5 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-30DOI: 10.1007/s10618-024-01032-z
Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
Time series motif discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.
{"title":"LoCoMotif: discovering time-warped motifs in time series","authors":"Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel","doi":"10.1007/s10618-024-01032-z","DOIUrl":"https://doi.org/10.1007/s10618-024-01032-z","url":null,"abstract":"<p>Time series motif discovery (TSMD) refers to the task of identifying patterns that occur multiple times (possibly with minor variations) in a time series. All existing methods for TSMD have one or more of the following limitations: they only look for the two most similar occurrences of a pattern; they only look for patterns of a pre-specified, fixed length; they cannot handle variability along the time axis; and they only handle univariate time series. In this paper, we present a new method, LoCoMotif, that has none of these limitations. The method is motivated by a concrete use case from physiotherapy. We demonstrate the value of the proposed method on this use case. We also introduce a new quantitative evaluation metric for motif discovery, and benchmark data for comparing TSMD methods. LoCoMotif substantially outperforms the existing methods, on top of being more broadly applicable.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"44 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141197846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1007/s10618-024-01031-0
Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi
Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., multi-dimensional data) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high (epsilon)), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.
{"title":"On the impact of multi-dimensional local differential privacy on fairness","authors":"Karima Makhlouf, Héber H. Arcolezi, Sami Zhioua, Ghassen Ben Brahim, Catuscia Palamidessi","doi":"10.1007/s10618-024-01031-0","DOIUrl":"https://doi.org/10.1007/s10618-024-01031-0","url":null,"abstract":"<p>Automated decision systems are increasingly used to make consequential decisions in people’s lives. Due to the sensitivity of the manipulated data and the resulting decisions, several ethical concerns need to be addressed for the appropriate use of such technologies, particularly fairness and privacy. Unlike previous work, which focused on centralized differential privacy (DP) or on local DP (LDP) for a single sensitive attribute, in this paper, we examine the impact of LDP in the presence of several sensitive attributes (i.e., <i>multi-dimensional data</i>) on fairness. Detailed empirical analysis on synthetic and benchmark datasets revealed very relevant observations. In particular, (1) multi-dimensional LDP is an efficient approach to reduce disparity, (2) the variant of the multi-dimensional approach of LDP (we employ two variants) matters only at low privacy guarantees (high <span>(epsilon)</span>), and (3) the true decision distribution has an important effect on which group is more sensitive to the obfuscation. Last, we summarize our findings in the form of recommendations to guide practitioners in adopting effective privacy-preserving practices while maintaining fairness and utility in machine learning applications.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"37 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-27DOI: 10.1007/s10618-024-01030-1
Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li
Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. wide learning). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or ANN (aka. deep learning). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. factorized learning). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, WD-Learning, xdeepFM, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—WBDF-Learning, based on the combination of wide, deep, factorization, and a newly introduced component named Broad Interaction network (BIN). BIN is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with wide, deep and factorized parts. We denote the learning of BIN parameters as broad learning. Additionally, the parameters of BIN are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from BIN, which can facilitate learning and provides a framework for knowledge-guided machine learning. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, WBDF framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.
大规模分类数据集在机器学习中无处不在,大多数已部署的机器学习模型的成功取决于如何有效地设计特征。对于大规模数据集,一般会使用参数方法,其中有三种特征工程策略比较常见。第一种策略侧重于管理网络的广度(或宽度),例如广义线性模型(又称广义学习)。第二种策略侧重于网络的深度,例如人工神经网络或 ANN(又称深度学习)。第三种策略依赖于将交互项因子化,例如因子化机器(又称因子化学习)。这些策略各有利弊。最近的研究表明,对于分类数据来说,将各种策略结合起来可以取得很好的效果。例如,WD-Learning、xdeepFM 等都能带来最先进的结果。顺应这一趋势,我们在这项工作中提出了另一种学习框架--WBDF-Learning,它基于广度、深度、因式分解和新引入的名为 "广度交互网络(BIN)"的组件的组合。BIN 是贝叶斯网络分类器的一种形式,其结构是先验学习的,而参数则是通过优化联合目标函数以及广度、深度和因子化部分来学习的。我们将 BIN 参数的学习称为广义学习。此外,BIN 的参数受限于实际概率,因此具有极高的可解释性。此外,人们还可以从 BIN 中采样或生成数据,这可以促进学习,并为知识引导的机器学习提供了一个框架。我们证明,我们提出的框架具有强大的复原力,在面对有偏见的数据集时仍能保持出色的分类性能。我们评估了我们的框架在各种基准大规模分类数据集上的分类性能,并与最先进的方法进行了比较。结果表明,WBDF 框架(a)在分类任务中表现出卓越的性能,(b)拥有出色的可解释性,(c)在涉及偏斜分布的情况下表现出非凡的弹性和有效性。
{"title":"Effective interpretable learning for large-scale categorical data","authors":"Yishuo Zhang, Nayyar Zaidi, Jiahui Zhou, Tao Wang, Gang Li","doi":"10.1007/s10618-024-01030-1","DOIUrl":"https://doi.org/10.1007/s10618-024-01030-1","url":null,"abstract":"<p>Large scale categorical datasets are ubiquitous in machine learning and the success of most deployed machine learning models rely on how effectively the features are engineered. For large-scale datasets, parametric methods are generally used, among which three strategies for feature engineering are quite common. The first strategy focuses on managing the breadth (or width) of a network, e.g., generalized linear models (aka. <span>wide learning</span>). The second strategy focuses on the depth of a network, e.g., Artificial Neural networks or <span>ANN</span> (aka. <span>deep learning</span>). The third strategy relies on factorizing the interaction terms, e.g., Factorization Machines (aka. <span>factorized learning</span>). Each of these strategies brings its own advantages and disadvantages. Recently, it has been shown that for categorical data, combination of various strategies leads to excellent results. For example, <span>WD</span>-Learning, <span>xdeepFM</span>, etc., leads to state-of-the-art results. Following the trend, in this work, we have proposed another learning framework—<span>WBDF</span>-Learning, based on the combination of <span>wide</span>, <span>deep</span>, <span>factorization</span>, and a newly introduced component named <span>Broad Interaction network</span> (<span>BIN</span>). <span>BIN</span> is in the form of a Bayesian network classifier whose structure is learned apriori, and parameters are learned by optimizing a joint objective function along with <span>wide</span>, <span>deep</span> and <span>factorized</span> parts. We denote the learning of <span>BIN</span> parameters as <span>broad learning</span>. Additionally, the parameters of <span>BIN</span> are constrained to be actual probabilities—therefore, it is extremely interpretable. Furthermore, one can sample or generate data from <span>BIN</span>, which can facilitate learning and provides a framework for <i>knowledge-guided machine learning</i>. We demonstrate that our proposed framework possesses the resilience to maintain excellent classification performance when confronted with biased datasets. We evaluate the efficacy of our framework in terms of classification performance on various benchmark large-scale categorical datasets and compare against state-of-the-art methods. It is shown that, <span>WBDF</span> framework (a) exhibits superior performance on classification tasks, (b) boasts outstanding interpretability and (c) demonstrates exceptional resilience and effectiveness in scenarios involving skewed distributions.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"22 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-26DOI: 10.1007/s10618-024-01037-8
Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali
This article presents the method Wave Top-k Random-d Lineage Search (WaveLSea) which guides an expert through data mining results according to her interest. The method exploits expert feedback, combined with the relation between patterns to spread the expert’s interest. It avoids the typical feature definition step commonly used in interactive data mining which limits the flexibility of the discovery process. We empirically demonstrate that WaveLSea returns the most relevant results for the user’s subjective interest. Even with imperfect feedback, WaveLSea behavior remains robust as it primarily still delivers most interesting results during experiments on graph-structured data. In order to assess the robustness of the method we design novel oracles called soothsayers giving imperfect feedback. Finally, we complete our quantitative study with a qualitative study using a user interface to evaluate WaveLSea.
{"title":"WaveLSea: helping experts interactively explore pattern mining search spaces","authors":"Etienne Lehembre, Bruno Cremilleux, Albrecht Zimmermann, Bertrand Cuissart, Abdelkader Ouali","doi":"10.1007/s10618-024-01037-8","DOIUrl":"https://doi.org/10.1007/s10618-024-01037-8","url":null,"abstract":"<p>This article presents the method Wave Top-k Random-d Lineage Search (WaveLSea) which guides an expert through data mining results according to her interest. The method exploits expert feedback, combined with the relation between patterns to spread the expert’s interest. It avoids the typical feature definition step commonly used in interactive data mining which limits the flexibility of the discovery process. We empirically demonstrate that WaveLSea returns the most relevant results for the user’s subjective interest. Even with imperfect feedback, WaveLSea behavior remains robust as it primarily still delivers most interesting results during experiments on graph-structured data. In order to assess the robustness of the method we design novel oracles called soothsayers giving imperfect feedback. Finally, we complete our quantitative study with a qualitative study using a user interface to evaluate WaveLSea.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"98 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-25DOI: 10.1007/s10618-024-01026-x
Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy
Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the Upper Confidence Bound of the Expected Utility (UCB-EU)–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.
{"title":"Active learning with biased non-response to label requests","authors":"Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy","doi":"10.1007/s10618-024-01026-x","DOIUrl":"https://doi.org/10.1007/s10618-024-01026-x","url":null,"abstract":"<p>Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the <i>Upper Confidence Bound of the Expected Utility (UCB-EU)</i>–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"36 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.1007/s10618-024-01036-9
Angus Dempster, Daniel F. Schmidt, Geoffrey I. Webb
We show that it is possible to achieve the same accuracy, on average, as the most accurate existing interval methods for time series classification on a standard set of benchmark datasets using a single type of feature (quantiles), fixed intervals, and an ‘off the shelf’ classifier. This distillation of interval-based approaches represents a fast and accurate method for time series classification, achieving state-of-the-art accuracy on the expanded set of 142 datasets in the UCR archive with a total compute time (training and inference) of less than 15 min using a single CPU core.
{"title":"quant: a minimalist interval method for time series classification","authors":"Angus Dempster, Daniel F. Schmidt, Geoffrey I. Webb","doi":"10.1007/s10618-024-01036-9","DOIUrl":"https://doi.org/10.1007/s10618-024-01036-9","url":null,"abstract":"<p>We show that it is possible to achieve the same accuracy, on average, as the most accurate existing interval methods for time series classification on a standard set of benchmark datasets using a single type of feature (quantiles), fixed intervals, and an ‘off the shelf’ classifier. This distillation of interval-based approaches represents a fast and accurate method for time series classification, achieving state-of-the-art accuracy on the expanded set of 142 datasets in the UCR archive with a total compute time (training and inference) of less than 15 min using a single CPU core.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"50 1","pages":""},"PeriodicalIF":4.8,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141150085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}