A novel communication-efficient heterogeneous federated positive and unlabeled learning method for credit scoring

IF 4.3 2区工程技术 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers & Operations Research Pub Date : 2025-05-01 Epub Date: 2025-01-18 DOI:10.1016/j.cor.2025.106982

Yongqin Qiu , Yuanxing Chen , Kan Fang , Kuangnan Fang

{"title":"A novel communication-efficient heterogeneous federated positive and unlabeled learning method for credit scoring","authors":"Yongqin Qiu , Yuanxing Chen , Kan Fang , Kuangnan Fang","doi":"10.1016/j.cor.2025.106982","DOIUrl":null,"url":null,"abstract":"<div><div>Customer records include only customers in default (positive samples) and rejected customers (unlabeled samples), or positive and unlabeled (PU) data, which is a common scenario in emerging financial institutions. However, building credit scoring models using multiple small sample PU datasets with high dimensionality poses significant challenges, especially in light of the privacy constraints associated with transferring raw data. To tackle these challenges, this paper introduces a novel methodology called heterogeneous federated PU learning. This approach utilizes a fused penalty function to automatically divide coefficients into multiple clusters, while an efficient proximal gradient descent algorithm is introduced for model training, relying solely on gradients from local servers. Theoretical analysis establishes the oracle property of our proposed estimator. The simulation results show that, in terms of variable selection, parameter estimation, and prediction performance, our method is close to the Oracle estimator and outperforms the other alternatives. Empirical results indicate that our method can improve prediction performance and facilitate the identification of heterogeneity across datasets. Moreover, the estimated clustering structures further reveal that provinces that are geographically closer exhibit greater similarity in credit risk. This implies that the proposed methodology can effectively assist nascent financial institutions in identifying differences in risk factors across datasets and enhancing predictive accuracy.</div></div>","PeriodicalId":10542,"journal":{"name":"Computers & Operations Research","volume":"177 ","pages":"Article 106982"},"PeriodicalIF":4.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Operations Research","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0305054825000103","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Customer records include only customers in default (positive samples) and rejected customers (unlabeled samples), or positive and unlabeled (PU) data, which is a common scenario in emerging financial institutions. However, building credit scoring models using multiple small sample PU datasets with high dimensionality poses significant challenges, especially in light of the privacy constraints associated with transferring raw data. To tackle these challenges, this paper introduces a novel methodology called heterogeneous federated PU learning. This approach utilizes a fused penalty function to automatically divide coefficients into multiple clusters, while an efficient proximal gradient descent algorithm is introduced for model training, relying solely on gradients from local servers. Theoretical analysis establishes the oracle property of our proposed estimator. The simulation results show that, in terms of variable selection, parameter estimation, and prediction performance, our method is close to the Oracle estimator and outperforms the other alternatives. Empirical results indicate that our method can improve prediction performance and facilitate the identification of heterogeneity across datasets. Moreover, the estimated clustering structures further reveal that provinces that are geographically closer exhibit greater similarity in credit risk. This implies that the proposed methodology can effectively assist nascent financial institutions in identifying differences in risk factors across datasets and enhancing predictive accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种新的高效沟通的异构联邦正无标记学习方法

客户记录仅包括默认客户（阳性样本）和拒绝客户（未标记样本），或阳性和未标记（PU）数据，这是新兴金融机构中的常见场景。然而，使用多个具有高维的小样本PU数据集构建信用评分模型带来了重大挑战，特别是考虑到与传输原始数据相关的隐私限制。为了应对这些挑战，本文介绍了一种称为异构联邦PU学习的新方法。该方法利用融合惩罚函数自动将系数划分为多个聚类，同时引入有效的近端梯度下降算法进行模型训练，仅依赖于本地服务器的梯度。理论分析证实了该估计器的oracle性。仿真结果表明，在变量选择、参数估计和预测性能方面，我们的方法接近Oracle估计器，优于其他替代方法。实证结果表明，该方法可以提高预测性能，并有助于识别数据集之间的异质性。此外，估计的聚类结构进一步揭示了地理上更接近的省份在信用风险方面表现出更大的相似性。这意味着所提出的方法可以有效地帮助新兴金融机构识别数据集之间风险因素的差异，并提高预测的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computers & Operations Research 工程技术-工程：工业

CiteScore

8.60

自引率

8.70%

发文量

292

审稿时长

8.5 months

期刊介绍： Operations research and computers meet in a large number of scientific fields, many of which are of vital current concern to our troubled society. These include, among others, ecology, transportation, safety, reliability, urban planning, economics, inventory control, investment strategy and logistics (including reverse logistics). Computers & Operations Research provides an international forum for the application of computers and operations research techniques to problems in these and related fields.