Pub Date : 2024-04-15DOI: 10.1007/s10994-024-06544-9
Tomás Gutierrez, Davi Valladão, Bernardo K. Pagnoncelli
PolieDRO is a novel analytics framework for classification and regression that harnesses the power and flexibility of data-driven distributionally robust optimization (DRO) to circumvent the need for regularization hyperparameters. Recent literature shows that traditional machine learning methods such as SVM and (square-root) LASSO can be written as Wasserstein-based DRO problems. Inspired by those results we propose a hyperparameter-free ambiguity set that explores the polyhedral structure of data-driven convex hulls, generating computationally tractable regression and classification methods for any convex loss function. Numerical results based on 100 real-world databases and an extensive experiment with synthetically generated data show that our methods consistently outperform their traditional counterparts.
{"title":"PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization","authors":"Tomás Gutierrez, Davi Valladão, Bernardo K. Pagnoncelli","doi":"10.1007/s10994-024-06544-9","DOIUrl":"https://doi.org/10.1007/s10994-024-06544-9","url":null,"abstract":"<p>PolieDRO is a novel analytics framework for classification and regression that harnesses the power and flexibility of data-driven distributionally robust optimization (DRO) to circumvent the need for regularization hyperparameters. Recent literature shows that traditional machine learning methods such as SVM and (square-root) LASSO can be written as Wasserstein-based DRO problems. Inspired by those results we propose a hyperparameter-free ambiguity set that explores the polyhedral structure of data-driven convex hulls, generating computationally tractable regression and classification methods for any convex loss function. Numerical results based on 100 real-world databases and an extensive experiment with synthetically generated data show that our methods consistently outperform their traditional counterparts.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"14 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140611189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s10994-024-06526-x
Susobhan Ghosh, Raphael Kim, Prasidh Chhabria, Raaz Dwivedi, Predrag Klasnja, Peng Liao, Kelly Zhang, Susan Murphy
There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.
{"title":"Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling","authors":"Susobhan Ghosh, Raphael Kim, Prasidh Chhabria, Raaz Dwivedi, Predrag Klasnja, Peng Liao, Kelly Zhang, Susan Murphy","doi":"10.1007/s10994-024-06526-x","DOIUrl":"https://doi.org/10.1007/s10994-024-06526-x","url":null,"abstract":"<p>There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"57 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1007/s10994-024-06527-w
Francisco de Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo
Social media platforms enable the rapid dissemination and consumption of information. However, users instantly consume such content regardless of the reliability of the shared data. Consequently, the latter crowdsourcing model is exposed to manipulation. This work contributes with an explainable and online classification method to recognize fake news in real-time. The proposed method combines both unsupervised and supervised Machine Learning approaches with online created lexica. The profiling is built using creator-, content- and context-based features using Natural Language Processing techniques. The explainable classification mechanism displays in a dashboard the features selected for classification and the prediction confidence. The performance of the proposed solution has been validated with real data sets from Twitter and the results attain 80% accuracy and macro F-measure. This proposal is the first to jointly provide data stream processing, profiling, classification and explainability. Ultimately, the proposed early detection, isolation and explanation of fake news contribute to increase the quality and trustworthiness of social media contents.
{"title":"Exposing and explaining fake news on-the-fly","authors":"Francisco de Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo","doi":"10.1007/s10994-024-06527-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06527-w","url":null,"abstract":"<p>Social media platforms enable the rapid dissemination and consumption of information. However, users instantly consume such content regardless of the reliability of the shared data. Consequently, the latter crowdsourcing model is exposed to manipulation. This work contributes with an explainable and online classification method to recognize fake news in real-time. The proposed method combines both unsupervised and supervised Machine Learning approaches with online created lexica. The profiling is built using creator-, content- and context-based features using Natural Language Processing techniques. The explainable classification mechanism displays in a dashboard the features selected for classification and the prediction confidence. The performance of the proposed solution has been validated with real data sets from Twitter and the results attain 80% accuracy and macro <i>F</i>-measure. This proposal is the first to jointly provide data stream processing, profiling, classification and explainability. Ultimately, the proposed early detection, isolation and explanation of fake news contribute to increase the quality and trustworthiness of social media contents.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"36 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08DOI: 10.1007/s10994-024-06519-w
Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, Morteza Haghir Chehreghani
Deep learning-based approaches for generating novel drug molecules with specific properties have gained a lot of interest in the last few years. Recent studies have demonstrated promising performance for string-based generation of novel molecules utilizing reinforcement learning. In this paper, we develop a unified framework for using reinforcement learning for de novo drug design, wherein we systematically study various on- and off-policy reinforcement learning algorithms and replay buffers to learn an RNN-based policy to generate novel molecules predicted to be active against the dopamine receptor DRD2. Our findings suggest that it is advantageous to use at least both top-scoring and low-scoring molecules for updating the policy when structural diversity is essential. Using all generated molecules at an iteration seems to enhance performance stability for on-policy algorithms. In addition, when replaying high, intermediate, and low-scoring molecules, off-policy algorithms display the potential of improving the structural diversity and number of active molecules generated, but possibly at the cost of a longer exploration phase. Our work provides an open-source framework enabling researchers to investigate various reinforcement learning methods for de novo drug design.
{"title":"Utilizing reinforcement learning for de novo drug design","authors":"Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, Morteza Haghir Chehreghani","doi":"10.1007/s10994-024-06519-w","DOIUrl":"https://doi.org/10.1007/s10994-024-06519-w","url":null,"abstract":"<p>Deep learning-based approaches for generating novel drug molecules with specific properties have gained a lot of interest in the last few years. Recent studies have demonstrated promising performance for string-based generation of novel molecules utilizing reinforcement learning. In this paper, we develop a unified framework for using reinforcement learning for de novo drug design, wherein we systematically study various on- and off-policy reinforcement learning algorithms and replay buffers to learn an RNN-based policy to generate novel molecules predicted to be active against the dopamine receptor DRD2. Our findings suggest that it is advantageous to use at least both top-scoring and low-scoring molecules for updating the policy when structural diversity is essential. Using all generated molecules at an iteration seems to enhance performance stability for on-policy algorithms. In addition, when replaying high, intermediate, and low-scoring molecules, off-policy algorithms display the potential of improving the structural diversity and number of active molecules generated, but possibly at the cost of a longer exploration phase. Our work provides an open-source framework enabling researchers to investigate various reinforcement learning methods for de novo drug design.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"43 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08DOI: 10.1007/s10994-024-06537-8
Kejie Tang, Weidong Liu, Xiaojun Mao
Decentralized distributed learning has recently attracted significant attention in many applications in machine learning and signal processing. To solve a decentralized optimization with regularization, we propose a Multi-consensus Decentralized Primal-Dual Fixed Point (MD-PDFP) algorithm. We apply multiple consensus steps with the gradient tracking technique to extend the primal-dual fixed point method over a network. The communication complexities of our procedure are given under certain conditions. Moreover, we show that our algorithm is consistent under general conditions and enjoys global linear convergence under strong convexity. With some particular choices of regularizations, our algorithm can be applied to decentralized machine learning applications. Finally, several numerical experiments and real data analyses are conducted to demonstrate the effectiveness of the proposed algorithm.
{"title":"Multi-consensus decentralized primal-dual fixed point algorithm for distributed learning","authors":"Kejie Tang, Weidong Liu, Xiaojun Mao","doi":"10.1007/s10994-024-06537-8","DOIUrl":"https://doi.org/10.1007/s10994-024-06537-8","url":null,"abstract":"<p>Decentralized distributed learning has recently attracted significant attention in many applications in machine learning and signal processing. To solve a decentralized optimization with regularization, we propose a Multi-consensus Decentralized Primal-Dual Fixed Point (MD-PDFP) algorithm. We apply multiple consensus steps with the gradient tracking technique to extend the primal-dual fixed point method over a network. The communication complexities of our procedure are given under certain conditions. Moreover, we show that our algorithm is consistent under general conditions and enjoys global linear convergence under strong convexity. With some particular choices of regularizations, our algorithm can be applied to decentralized machine learning applications. Finally, several numerical experiments and real data analyses are conducted to demonstrate the effectiveness of the proposed algorithm.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"43 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-08DOI: 10.1007/s10994-024-06538-7
Andreas Bueff, Vaishak Belle
Deep neural networks, despite their capabilities, are constrained by the need for large-scale training data, and often fall short in generalisation and interpretability. Inductive logic programming (ILP) presents an intriguing solution with its data-efficient learning of first-order logic rules. However, ILP grapples with challenges, notably the handling of non-linearity in continuous domains. With the ascent of neuro-symbolic ILP, there’s a drive to mitigate these challenges, synergising deep learning with relational ILP models to enhance interpretability and create logical decision boundaries. In this research, we introduce a neuro-symbolic ILP framework, grounded on differentiable Neural Logic networks, tailored for non-linear rule extraction in mixed discrete-continuous spaces. Our methodology consists of a neuro-symbolic approach, emphasising the extraction of non-linear functions from mixed domain data. Our preliminary findings showcase our architecture’s capability to identify non-linear functions from continuous data, offering a new perspective in neural-symbolic research and underlining the adaptability of ILP-based frameworks for regression challenges in continuous scenarios.
{"title":"Learning explanatory logical rules in non-linear domains: a neuro-symbolic approach","authors":"Andreas Bueff, Vaishak Belle","doi":"10.1007/s10994-024-06538-7","DOIUrl":"https://doi.org/10.1007/s10994-024-06538-7","url":null,"abstract":"<p>Deep neural networks, despite their capabilities, are constrained by the need for large-scale training data, and often fall short in generalisation and interpretability. Inductive logic programming (ILP) presents an intriguing solution with its data-efficient learning of first-order logic rules. However, ILP grapples with challenges, notably the handling of non-linearity in continuous domains. With the ascent of neuro-symbolic ILP, there’s a drive to mitigate these challenges, synergising deep learning with relational ILP models to enhance interpretability and create logical decision boundaries. In this research, we introduce a neuro-symbolic ILP framework, grounded on differentiable Neural Logic networks, tailored for non-linear rule extraction in mixed discrete-continuous spaces. Our methodology consists of a neuro-symbolic approach, emphasising the extraction of non-linear functions from mixed domain data. Our preliminary findings showcase our architecture’s capability to identify non-linear functions from continuous data, offering a new perspective in neural-symbolic research and underlining the adaptability of ILP-based frameworks for regression challenges in continuous scenarios.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"29 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-03DOI: 10.1007/s10994-024-06536-9
Rui-Ray Zhang, Massih-Reza Amini
Traditional statistical learning theory relies on the assumption that data are identically and independently distributed (i.i.d.). However, this assumption often does not hold in many real-life applications. In this survey, we explore learning scenarios where examples are dependent and their dependence relationship is described by a dependency graph, a commonly utilized model in probability and combinatorics. We collect various graph-dependent concentration bounds, which are then used to derive Rademacher complexity and stability generalization bounds for learning from graph-dependent data. We illustrate this paradigm through practical learning tasks and provide some research directions for future work. To our knowledge, this survey is the first of this kind on this subject.
{"title":"Generalization bounds for learning under graph-dependence: a survey","authors":"Rui-Ray Zhang, Massih-Reza Amini","doi":"10.1007/s10994-024-06536-9","DOIUrl":"https://doi.org/10.1007/s10994-024-06536-9","url":null,"abstract":"<p>Traditional statistical learning theory relies on the assumption that data are identically and independently distributed (i.i.d.). However, this assumption often does not hold in many real-life applications. In this survey, we explore learning scenarios where examples are dependent and their dependence relationship is described by a <i>dependency graph</i>, a commonly utilized model in probability and combinatorics. We collect various graph-dependent concentration bounds, which are then used to derive Rademacher complexity and stability generalization bounds for learning from graph-dependent data. We illustrate this paradigm through practical learning tasks and provide some research directions for future work. To our knowledge, this survey is the first of this kind on this subject.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"13 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A combination of tools and methods known as "graph mining" is used to evaluate real-world graphs, forecast the potential effects of a given graph’s structure and properties for various applications, and build models that can yield actual graphs that closely resemble the structure seen in real-world graphs of interest. However, some graph mining approaches possess scalability and dynamic graph challenges, limiting practical applications. In machine learning and data mining, among the unique methods is graph embedding, known as network representation learning where representative methods suggest encoding the complicated graph structures into embedding by utilizing specific pre-defined metrics. Co-occurrence graphs and keyword searches are the foundation of search engine optimizations for diverse real-world applications. Current work on keyword searches on graphs is based on pre-established information retrieval search criteria and does not provide semantic linkages. Recent works on co-occurrence and keyword search methods function effectively on graphs with only one layer instead of many layers. However, the graph neural network has been utilized in recent years as a branch of graph model due to its excellent performance. This paper proposes an Effective Keyword Search Co-occurrence Multi-Layer Graph mining method by employing two core approaches: Multi-layer Graph Embedding and Graph Neural Networks. We conducted extensive tests using benchmarks on real-world data sets. Considering the experimental findings, the proposed method enhanced with the regularization approach is substantially excellent, with a 10% increment in precision, recall, and f1-score.
被称为 "图挖掘 "的工具和方法组合可用于评估现实世界中的图,预测给定图的结构和属性对各种应用的潜在影响,并建立模型,以生成与现实世界中相关图的结构非常相似的实际图。然而,一些图挖掘方法面临着可扩展性和动态图的挑战,限制了实际应用。在机器学习和数据挖掘领域,图嵌入是一种独特的方法,被称为网络表示学习,其中具有代表性的方法建议利用特定的预定义指标将复杂的图结构编码为嵌入。共现图和关键词搜索是搜索引擎优化的基础,适用于各种实际应用。目前在图上进行关键词搜索的工作是基于预先确定的信息检索搜索标准,并不提供语义链接。最近关于共现和关键词搜索方法的研究成果能在只有一层而非多层的图上有效发挥作用。然而,图神经网络作为图模型的一个分支,因其出色的性能近年来得到了广泛应用。本文通过采用两种核心方法,提出了一种有效的关键词搜索共现多层图挖掘方法:多层图嵌入和图神经网络。我们利用真实世界数据集上的基准进行了大量测试。从实验结果来看,使用正则化方法增强的拟议方法非常出色,精确度、召回率和 f1 分数均提高了 10%。
{"title":"An effective keyword search co-occurrence multi-layer graph mining approach","authors":"Janet Oluwasola Bolorunduro, Zhaonian Zou, Mohamed Jaward Bah","doi":"10.1007/s10994-024-06528-9","DOIUrl":"https://doi.org/10.1007/s10994-024-06528-9","url":null,"abstract":"<p>A combination of tools and methods known as \"graph mining\" is used to evaluate real-world graphs, forecast the potential effects of a given graph’s structure and properties for various applications, and build models that can yield actual graphs that closely resemble the structure seen in real-world graphs of interest. However, some graph mining approaches possess scalability and dynamic graph challenges, limiting practical applications. In machine learning and data mining, among the unique methods is graph embedding, known as network representation learning where representative methods suggest encoding the complicated graph structures into embedding by utilizing specific pre-defined metrics. Co-occurrence graphs and keyword searches are the foundation of search engine optimizations for diverse real-world applications. Current work on keyword searches on graphs is based on pre-established information retrieval search criteria and does not provide semantic linkages. Recent works on co-occurrence and keyword search methods function effectively on graphs with only one layer instead of many layers. However, the graph neural network has been utilized in recent years as a branch of graph model due to its excellent performance. This paper proposes an Effective Keyword Search Co-occurrence Multi-Layer Graph mining method by employing two core approaches: Multi-layer Graph Embedding and Graph Neural Networks. We conducted extensive tests using benchmarks on real-world data sets. Considering the experimental findings, the proposed method enhanced with the regularization approach is substantially excellent, with a 10% increment in precision, recall, and f1-score.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"32 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140596326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-29DOI: 10.1007/s10994-023-06495-7
Zayd Hammoudeh, Daniel Lowd
Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training’s underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data’s influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound.
{"title":"Training data influence analysis and estimation: a survey","authors":"Zayd Hammoudeh, Daniel Lowd","doi":"10.1007/s10994-023-06495-7","DOIUrl":"https://doi.org/10.1007/s10994-023-06495-7","url":null,"abstract":"<p>Good models require good training data. For overparameterized deep models, the causal relationship between training data and model predictions is increasingly opaque and poorly understood. Influence analysis partially demystifies training’s underlying interactions by quantifying the amount each training instance alters the final model. Measuring the training data’s influence exactly can be provably hard in the worst case; this has led to the development and use of influence estimators, which only approximate the true influence. This paper provides the first comprehensive survey of training data influence analysis and estimation. We begin by formalizing the various, and in places orthogonal, definitions of training data influence. We then organize state-of-the-art influence analysis methods into a taxonomy; we describe each of these methods in detail and compare their underlying assumptions, asymptotic complexities, and overall strengths and weaknesses. Finally, we propose future research directions to make influence analysis more useful in practice as well as more theoretically and empirically sound.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"43 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-29DOI: 10.1007/s10994-024-06534-x
Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, Jesse Davis
Machine learning models always make a prediction, even when it is likely to be inaccurate. This behavior should be avoided in many decision support applications, where mistakes can have severe consequences. Albeit already studied in 1970, machine learning with rejection recently gained interest. This machine learning subfield enables machine learning models to abstain from making a prediction when likely to make a mistake. This survey aims to provide an overview on machine learning with rejection. We introduce the conditions leading to two types of rejection, ambiguity and novelty rejection, which we carefully formalize. Moreover, we review and categorize strategies to evaluate a model’s predictive and rejective quality. Additionally, we define the existing architectures for models with rejection and describe the standard techniques for learning such models. Finally, we provide examples of relevant application domains and show how machine learning with rejection relates to other machine learning research areas.
{"title":"Machine learning with a reject option: a survey","authors":"Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, Jesse Davis","doi":"10.1007/s10994-024-06534-x","DOIUrl":"https://doi.org/10.1007/s10994-024-06534-x","url":null,"abstract":"<p>Machine learning models always make a prediction, even when it is likely to be inaccurate. This behavior should be avoided in many decision support applications, where mistakes can have severe consequences. Albeit already studied in 1970, machine learning with rejection recently gained interest. This machine learning subfield enables machine learning models to abstain from making a prediction when likely to make a mistake. This survey aims to provide an overview on machine learning with rejection. We introduce the conditions leading to two types of rejection, ambiguity and novelty rejection, which we carefully formalize. Moreover, we review and categorize strategies to evaluate a model’s predictive and rejective quality. Additionally, we define the existing architectures for models with rejection and describe the standard techniques for learning such models. Finally, we provide examples of relevant application domains and show how machine learning with rejection relates to other machine learning research areas.</p>","PeriodicalId":49900,"journal":{"name":"Machine Learning","volume":"17 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140884416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}