Joint empirical risk minimization for instance-dependent positive-unlabeled data

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-11-25 Epub Date: 2024-09-06 DOI:10.1016/j.knosys.2024.112444

Wojciech Rejchel , Paweł Teisseyre , Jan Mielniczuk

{"title":"Joint empirical risk minimization for instance-dependent positive-unlabeled data","authors":"Wojciech Rejchel , Paweł Teisseyre , Jan Mielniczuk","doi":"10.1016/j.knosys.2024.112444","DOIUrl":null,"url":null,"abstract":"<div><p>Learning from positive and unlabeled data (PU learning) is actively researched machine learning task. The goal is to train a binary classification model based on a training dataset containing part of positives which are labeled, and unlabeled instances. Unlabeled set includes remaining part of positives and all negative observations. An important element in PU learning is modeling of the labeling mechanism, i.e. labels’ assignment to positive observations. Unlike in many prior works, we consider a realistic setting for which probability of label assignment, i.e. propensity score, is instance-dependent. In our approach we investigate minimizer of an empirical counterpart of a joint risk which depends on both posterior probability of inclusion in a positive class as well as on a propensity score. The non-convex empirical risk is alternately optimized with respect to parameters of both functions. In the theoretical analysis we establish risk consistency of the minimizers using recently derived methods from the theory of empirical processes. Besides, the important development here is a proposed novel implementation of an optimization algorithm, for which sequential approximation of a set of positive observations among unlabeled ones is crucial. This relies on modified technique of ’spies’ as well as on a thresholding rule based on conditional probabilities. Experiments conducted on 20 data sets for various labeling scenarios show that the proposed method works on par or more effectively than state-of-the-art methods based on propensity function estimation.</p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"304 ","pages":"Article 112444"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124010785","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/6 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Learning from positive and unlabeled data (PU learning) is actively researched machine learning task. The goal is to train a binary classification model based on a training dataset containing part of positives which are labeled, and unlabeled instances. Unlabeled set includes remaining part of positives and all negative observations. An important element in PU learning is modeling of the labeling mechanism, i.e. labels’ assignment to positive observations. Unlike in many prior works, we consider a realistic setting for which probability of label assignment, i.e. propensity score, is instance-dependent. In our approach we investigate minimizer of an empirical counterpart of a joint risk which depends on both posterior probability of inclusion in a positive class as well as on a propensity score. The non-convex empirical risk is alternately optimized with respect to parameters of both functions. In the theoretical analysis we establish risk consistency of the minimizers using recently derived methods from the theory of empirical processes. Besides, the important development here is a proposed novel implementation of an optimization algorithm, for which sequential approximation of a set of positive observations among unlabeled ones is crucial. This relies on modified technique of ’spies’ as well as on a thresholding rule based on conditional probabilities. Experiments conducted on 20 data sets for various labeling scenarios show that the proposed method works on par or more effectively than state-of-the-art methods based on propensity function estimation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

实例依赖性正非标记数据的联合经验风险最小化

从正向和非标注数据中学习（PU 学习）是目前正在积极研究的机器学习任务。其目标是基于训练数据集训练二元分类模型，训练数据集包含部分已标注的正向数据和未标注的实例。未标记集包括剩余的正向数据和所有负向观测数据。PU 学习中的一个重要因素是标记机制的建模，即对正向观测结果的标签分配。与之前的许多研究不同，我们考虑的是一种现实的情况，即标签分配的概率（即倾向得分）与实例相关。在我们的方法中，我们研究的是经验对应联合风险的最小化，该联合风险既取决于纳入正向类别的后验概率，也取决于倾向得分。非凸经验风险根据两个函数的参数交替优化。在理论分析中，我们利用最近从经验过程理论中得出的方法，确定了最小化的风险一致性。此外，本文的重要进展是提出了一种新的优化算法实现方法，对于这种算法来说，在未标记的观测数据中依次逼近一组正向观测数据至关重要。这依赖于改进的 "间谍 "技术以及基于条件概率的阈值规则。在 20 个数据集上针对各种标签情况进行的实验表明，所提出的方法与基于倾向函数估计的最先进方法效果相当，甚至更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.