使用“ubSMOTE”进行过采样对机器学习分类器在预测灾难性医疗支出中的性能的影响

IF 1.5 Q3 HEALTH CARE SCIENCES & SERVICES Operations Research for Health Care Pub Date : 2020-12-01 DOI:10.1016/j.orhc.2020.100275
Songul Cinaroglu
{"title":"使用“ubSMOTE”进行过采样对机器学习分类器在预测灾难性医疗支出中的性能的影响","authors":"Songul Cinaroglu","doi":"10.1016/j.orhc.2020.100275","DOIUrl":null,"url":null,"abstract":"<div><p>As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.</p></div>","PeriodicalId":46320,"journal":{"name":"Operations Research for Health Care","volume":"27 ","pages":"Article 100275"},"PeriodicalIF":1.5000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.orhc.2020.100275","citationCount":"1","resultStr":"{\"title\":\"The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures\",\"authors\":\"Songul Cinaroglu\",\"doi\":\"10.1016/j.orhc.2020.100275\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.</p></div>\",\"PeriodicalId\":46320,\"journal\":{\"name\":\"Operations Research for Health Care\",\"volume\":\"27 \",\"pages\":\"Article 100275\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/j.orhc.2020.100275\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Operations Research for Health Care\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2211692320300552\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operations Research for Health Care","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211692320300552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 1

摘要

分类不平衡是分类任务中的一个常见问题,它会降低分类器的性能。灾难性自费医疗支出是极少数家庭面临的罕见事件的一个具体例子。本研究的目的是展示一种两步学习方法来建模高度不平衡的灾难性面向对象卫生支出数据。数据来自土耳其统计研究所2012年收集的具有全国代表性的家庭预算调查。总共有9987户家庭收到了有效的调查回复。预测模型是基于灾难性OOP卫生支出的8个常见风险因素。利用合成少数派过采样技术(SMOTE)函数对训练数据集中的少数派类进行过采样,并利用原始和平衡过采样训练数据集建立分类模型。确定了逻辑回归(LR)、随机森林(RF)(100棵树)、支持向量机(SVM)和神经网络(NN)作为分类器。面临灾难性OOP卫生支出的家庭加权百分比为0.14。均衡过采样使LR、RF、SVM和NN的受试者工作特征(ROC)曲线下面积分别增加0.08%、0.62%、0.20%和0.23%。ROC曲线显示NN和RF是平衡过采样数据集的最佳分类器。确定一个分类器来模拟高度不平衡的灾难性面向对象卫生支出,需要两个阶段的过程:(i)考虑类别之间的平衡,(ii)比较替代分类器。神经网络和射频在具有不平衡的灾难性面向对象卫生支出数据的预测任务中是很好的分类器。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
The impact of oversampling with “ubSMOTE” on the performance of machine learning classifiers in prediction of catastrophic health expenditures

As a common problem in classification tasks, class imbalance degrades the performance of the classifier. Catastrophic out-of-pocket (OOP) health expenditure is a specific example of a rare event faced by very few households. The objective of the present study is to demonstrate a two-step learning approach for modeling highly unbalanced catastrophic OOP health expenditure data. The data are retrieved from the nationally representative Household Budget Survey collected in 2012 by the Turkish Statistical Institute. In total, 9987 households returned valid survey responses. The predictive models are based on eight common risk factors of catastrophic OOP health expenditure. The minority class in the training dataset is oversampled by using a synthetic minority oversampling technique (SMOTE) function, and the original and balanced oversampled training datasets are used to establish the classification models. Logistic regression (LR), random forest (RF) (100 trees), support vector machine (SVM), and neural network (NN) are determined as classifiers. The weighted percentage of households faced with catastrophic OOP health expenditure is 0.14. Balanced oversampling increases the area under the receiver operating characteristic (ROC) curve of LR, RF, SVM, and NN by 0.08%, 0.62%, 0.20%, and 0.23%, respectively. The ROC curve shows NN and RF to be the best classifiers for a balanced oversampled dataset. Identifying a classifier to model highly imbalanced catastrophic OOP health expenditure requires the two-stage procedure of (i) considering a balance between classes and (ii) comparing alternative classifiers. NN and RF are good classifiers in a prediction task with imbalanced catastrophic OOP health expenditure data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Operations Research for Health Care
Operations Research for Health Care HEALTH CARE SCIENCES & SERVICES-
CiteScore
3.90
自引率
0.00%
发文量
9
审稿时长
69 days
期刊最新文献
Editorial Board Preference-based allocation of patients to nursing homes Balancing continuity of care and home care schedule costs using blueprint routes Outpatient appointment systems: A new heuristic with patient classification A modeling framework for evaluating proactive and reactive nurse rostering strategies — A case study from a Neonatal Intensive Care Unit
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1