Sample selection bias in non-traditional lending: A copula-based approach for imbalanced data

IF 6.2 2区经济学 Q1 ECONOMICS Socio-economic Planning Sciences Pub Date : 2024-08-29 DOI:10.1016/j.seps.2024.102045

Raffaella Calabrese , Silvia Angela Osmetti , Luca Zanin

{"title":"Sample selection bias in non-traditional lending: A copula-based approach for imbalanced data","authors":"Raffaella Calabrese , Silvia Angela Osmetti , Luca Zanin","doi":"10.1016/j.seps.2024.102045","DOIUrl":null,"url":null,"abstract":"<div><p>Credit scoring models for non-traditional lending channels, such as peer-to-peer (P2P) lending platforms, are usually estimated only on the sample of accepted applicants. This may lead to biased estimates of the risk drivers. This issue can be addressed using a reject inference technique that includes the characteristics of rejected applicants in the model. Due to the low numbers of accepted applicants and default records, credit scoring models usually face a class imbalance problem. However, previous literature on sample selection models for credit scoring does not address the class imbalance issue. To fill this gap, we extend the Generalised Extreme Value (GEV) regression model for binary data to the sample selection framework. We consider the quantile function of the GEV distribution as a link function in both the selection and outcome equations. We use the copula function to model the dependence structure between the two equations for its flexibility. This proposal is called the Sample Selection Generalised Extreme Value (SSGEV) model and it is implemented in the R package BivGEV. We apply this model to a comprehensive dataset provided by Lending Club, and we show that parameter estimates obtained only on accepted P2P applicants are biased and coherently with the literature. The SSGEV model achieves a higher predictive accuracy than those obtained using univariate approaches or a sample selection probit model. Our proposal also provides more conservative estimates of the Value-at-Risk and the Expected Shortfall.</p></div>","PeriodicalId":22033,"journal":{"name":"Socio-economic Planning Sciences","volume":"95 ","pages":"Article 102045"},"PeriodicalIF":6.2000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Socio-economic Planning Sciences","FirstCategoryId":"96","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0038012124002441","RegionNum":2,"RegionCategory":"经济学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ECONOMICS","Score":null,"Total":0}

引用次数: 0

Abstract

Credit scoring models for non-traditional lending channels, such as peer-to-peer (P2P) lending platforms, are usually estimated only on the sample of accepted applicants. This may lead to biased estimates of the risk drivers. This issue can be addressed using a reject inference technique that includes the characteristics of rejected applicants in the model. Due to the low numbers of accepted applicants and default records, credit scoring models usually face a class imbalance problem. However, previous literature on sample selection models for credit scoring does not address the class imbalance issue. To fill this gap, we extend the Generalised Extreme Value (GEV) regression model for binary data to the sample selection framework. We consider the quantile function of the GEV distribution as a link function in both the selection and outcome equations. We use the copula function to model the dependence structure between the two equations for its flexibility. This proposal is called the Sample Selection Generalised Extreme Value (SSGEV) model and it is implemented in the R package BivGEV. We apply this model to a comprehensive dataset provided by Lending Club, and we show that parameter estimates obtained only on accepted P2P applicants are biased and coherently with the literature. The SSGEV model achieves a higher predictive accuracy than those obtained using univariate approaches or a sample selection probit model. Our proposal also provides more conservative estimates of the Value-at-Risk and the Expected Shortfall.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非传统贷款中的样本选择偏差：基于 copula 的不平衡数据处理方法

针对非传统借贷渠道（如点对点（P2P）借贷平台）的信用评分模型通常只对已接受的申请人样本进行估算。这可能导致对风险驱动因素的估计存在偏差。使用拒绝推断技术可以解决这一问题，该技术将被拒绝的申请人的特征纳入模型中。由于被接受的申请人和违约记录的数量较少，信用评分模型通常会面临类别不平衡的问题。然而，以往关于信用评分样本选择模型的文献并没有解决类别不平衡问题。为了填补这一空白，我们将二元数据的广义极值回归模型（GEV）扩展到了样本选择框架。我们将 GEV 分布的量化函数视为选择方程和结果方程中的链接函数。我们使用 copula 函数对两个方程之间的依赖结构进行建模，以提高其灵活性。这一建议被称为样本选择广义极值（SSGEV）模型，并在 R 软件包 BivGEV 中实现。我们将该模型应用于 Lending Club 提供的综合数据集，结果表明，仅从被接受的 P2P 申请人身上获得的参数估计是有偏差的，且与文献一致。SSGEV 模型比使用单变量方法或样本选择概率模型获得的预测准确性更高。我们的建议还提供了更保守的风险价值和预期缺口估计值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Socio-economic Planning Sciences OPERATIONS RESEARCH & MANAGEMENT SCIENCE-

CiteScore

9.40

自引率

13.10%

发文量

294

审稿时长

58 days

期刊介绍： Studies directed toward the more effective utilization of existing resources, e.g. mathematical programming models of health care delivery systems with relevance to more effective program design; systems analysis of fire outbreaks and its relevance to the location of fire stations; statistical analysis of the efficiency of a developing country economy or industry. Studies relating to the interaction of various segments of society and technology, e.g. the effects of government health policies on the utilization and design of hospital facilities; the relationship between housing density and the demands on public transportation or other service facilities: patterns and implications of urban development and air or water pollution. Studies devoted to the anticipations of and response to future needs for social, health and other human services, e.g. the relationship between industrial growth and the development of educational resources in affected areas; investigation of future demands for material and child health resources in a developing country; design of effective recycling in an urban setting.