AIRA-ML:汽车保险风险评估-使用重采样方法的机器学习模型

IF 0.7 Q3 COMPUTER SCIENCE, THEORY & METHODS International Journal of Advanced Computer Science and Applications Pub Date : 2023-01-01 DOI:10.14569/ijacsa.2023.0140966

Ahmed Shawky Elbhrawy, Mohamed A. Belal, Mohamed Sameh Hassanein

{"title":"AIRA-ML:汽车保险风险评估-使用重采样方法的机器学习模型","authors":"Ahmed Shawky Elbhrawy, Mohamed A. Belal, Mohamed Sameh Hassanein","doi":"10.14569/ijacsa.2023.0140966","DOIUrl":null,"url":null,"abstract":"Predicting underwriting risk has become a major challenge due to the imbalanced datasets in the field. A real-world imbalanced dataset is used in this work with 12 variables in 30144 cases, where most of the cases were classified as \"accepting the insurance request\", while a small percentage classified as \"refusing insurance\". This work developed 55 machine learning (ML) models to predict whether or not to renew policies. The models were developed using the original dataset and four data-level approaches resampling techniques: random oversampling, SMOTE, random undersampling, and hybrid methods with 11 ML algorithms to address the issue of imbalanced data (11 ML× (4 resampling techniques + unbalanced datasets) = 55 ML models). Seven classifier efficiency measures were used to evaluate these 55 models that were developed using 11 ML algorithms: logistic regression (LR), random forest (RF), artificial neural network (ANN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB), decision tree (DT), XGBoost, k-nearest neighbors (KNN), stochastic gradient boosting (SGB), and AdaBoost. The seven classifier efficiency measures namely are accuracy, sensitivity, specificity, AUC, precision, F1-measure, and kappa. CRISP-DM methodology is utilisied to ensure that studies are conducted in a rigorous and systematic manner. Additionally, RapidMiner software was used to apply the algorithms and analyze the data, which highlighted the potential of ML to improve the accuracy of risk assessment in insurance underwriting. The results showed that all ML classifiers became more effective when using resampling strategies; where Hybrid resampling methods improved the performance of machine learning models on imbalanced data with an accuracy of 0.9967 and kappa statistics of 0.992 for the RF classifier.","PeriodicalId":13824,"journal":{"name":"International Journal of Advanced Computer Science and Applications","volume":"15 1","pages":"0"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AIRA-ML: Auto Insurance Risk Assessment-Machine Learning Model using Resampling Methods\",\"authors\":\"Ahmed Shawky Elbhrawy, Mohamed A. Belal, Mohamed Sameh Hassanein\",\"doi\":\"10.14569/ijacsa.2023.0140966\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Predicting underwriting risk has become a major challenge due to the imbalanced datasets in the field. A real-world imbalanced dataset is used in this work with 12 variables in 30144 cases, where most of the cases were classified as \\\"accepting the insurance request\\\", while a small percentage classified as \\\"refusing insurance\\\". This work developed 55 machine learning (ML) models to predict whether or not to renew policies. The models were developed using the original dataset and four data-level approaches resampling techniques: random oversampling, SMOTE, random undersampling, and hybrid methods with 11 ML algorithms to address the issue of imbalanced data (11 ML× (4 resampling techniques + unbalanced datasets) = 55 ML models). Seven classifier efficiency measures were used to evaluate these 55 models that were developed using 11 ML algorithms: logistic regression (LR), random forest (RF), artificial neural network (ANN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB), decision tree (DT), XGBoost, k-nearest neighbors (KNN), stochastic gradient boosting (SGB), and AdaBoost. The seven classifier efficiency measures namely are accuracy, sensitivity, specificity, AUC, precision, F1-measure, and kappa. CRISP-DM methodology is utilisied to ensure that studies are conducted in a rigorous and systematic manner. Additionally, RapidMiner software was used to apply the algorithms and analyze the data, which highlighted the potential of ML to improve the accuracy of risk assessment in insurance underwriting. The results showed that all ML classifiers became more effective when using resampling strategies; where Hybrid resampling methods improved the performance of machine learning models on imbalanced data with an accuracy of 0.9967 and kappa statistics of 0.992 for the RF classifier.\",\"PeriodicalId\":13824,\"journal\":{\"name\":\"International Journal of Advanced Computer Science and Applications\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Advanced Computer Science and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14569/ijacsa.2023.0140966\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advanced Computer Science and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14569/ijacsa.2023.0140966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

由于该领域数据集的不平衡，预测承保风险已成为一项重大挑战。在这项工作中使用了一个真实世界的不平衡数据集，在30144个案例中有12个变量，其中大多数案例被归类为“接受保险请求”，而一小部分被归类为“拒绝保险”。这项工作开发了55个机器学习(ML)模型来预测是否更新政策。这些模型是使用原始数据集和四种数据级方法重新采样技术开发的:随机过采样、SMOTE、随机欠采样和11 ML算法的混合方法，以解决数据不平衡问题(11 mlx(4重采样技术+不平衡数据集)= 55 ML模型)。7个分类器效率指标用于评估使用11 ML算法开发的55个模型:逻辑回归(LR)、随机森林(RF)、人工神经网络(ANN)、多层感知器(MLP)、支持向量机(SVM)、朴素贝叶斯(NB)、决策树(DT)、XGBoost、k近邻(KNN)、随机梯度增强(SGB)和AdaBoost。7个分类器效率指标分别是准确性、灵敏度、特异性、AUC、精度、F1-measure和kappa。采用CRISP-DM方法确保以严格和系统的方式进行研究。此外，使用RapidMiner软件应用算法并分析数据，这突出了机器学习在提高保险承保风险评估准确性方面的潜力。结果表明，当使用重采样策略时，所有ML分类器都变得更加有效;其中，混合重采样方法提高了机器学习模型在不平衡数据上的性能，RF分类器的准确率为0.9967,kappa统计量为0.992。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

AIRA-ML: Auto Insurance Risk Assessment-Machine Learning Model using Resampling Methods

Predicting underwriting risk has become a major challenge due to the imbalanced datasets in the field. A real-world imbalanced dataset is used in this work with 12 variables in 30144 cases, where most of the cases were classified as "accepting the insurance request", while a small percentage classified as "refusing insurance". This work developed 55 machine learning (ML) models to predict whether or not to renew policies. The models were developed using the original dataset and four data-level approaches resampling techniques: random oversampling, SMOTE, random undersampling, and hybrid methods with 11 ML algorithms to address the issue of imbalanced data (11 ML× (4 resampling techniques + unbalanced datasets) = 55 ML models). Seven classifier efficiency measures were used to evaluate these 55 models that were developed using 11 ML algorithms: logistic regression (LR), random forest (RF), artificial neural network (ANN), multilayer perceptron (MLP), support vector machine (SVM), naive Bayes (NB), decision tree (DT), XGBoost, k-nearest neighbors (KNN), stochastic gradient boosting (SGB), and AdaBoost. The seven classifier efficiency measures namely are accuracy, sensitivity, specificity, AUC, precision, F1-measure, and kappa. CRISP-DM methodology is utilisied to ensure that studies are conducted in a rigorous and systematic manner. Additionally, RapidMiner software was used to apply the algorithms and analyze the data, which highlighted the potential of ML to improve the accuracy of risk assessment in insurance underwriting. The results showed that all ML classifiers became more effective when using resampling strategies; where Hybrid resampling methods improved the performance of machine learning models on imbalanced data with an accuracy of 0.9967 and kappa statistics of 0.992 for the RF classifier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Advanced Computer Science and Applications COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

2.30

自引率

22.20%

发文量

519

期刊介绍： IJACSA is a scholarly computer science journal representing the best in research. Its mission is to provide an outlet for quality research to be publicised and published to a global audience. The journal aims to publish papers selected through rigorous double-blind peer review to ensure originality, timeliness, relevance, and readability. In sync with the Journal''s vision "to be a respected publication that publishes peer reviewed research articles, as well as review and survey papers contributed by International community of Authors", we have drawn reviewers and editors from Institutions and Universities across the globe. A double blind peer review process is conducted to ensure that we retain high standards. At IJACSA, we stand strong because we know that global challenges make way for new innovations, new ways and new talent. International Journal of Advanced Computer Science and Applications publishes carefully refereed research, review and survey papers which offer a significant contribution to the computer science literature, and which are of interest to a wide audience. Coverage extends to all main-stream branches of computer science and related applications