{"title":"使用CatBoost检测医疗保险欺诈","authors":"John T. Hancock, T. Khoshgoftaar","doi":"10.1109/IRI49571.2020.00022","DOIUrl":null,"url":null,"abstract":"In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost’s performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost’s performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.","PeriodicalId":93159,"journal":{"name":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Medicare Fraud Detection using CatBoost\",\"authors\":\"John T. Hancock, T. Khoshgoftaar\",\"doi\":\"10.1109/IRI49571.2020.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost’s performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost’s performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.\",\"PeriodicalId\":93159,\"journal\":{\"name\":\"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IRI49571.2020.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science : IRI 2020 : proceedings : virtual conference, 11-13 August 2020. IEEE International Conference on Information Reuse and Integration (21st : 2...","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI49571.2020.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In this study we investigate the performance of CatBoost in the task of identifying Medicare fraud. The Medicare claims data we use as input for CatBoost contain a number of categorical features. Some of these features, such as the procedure code and provider zip code, have thousands of possible values. One contribution we make in this study is to show how we use CatBoost to eliminate some data pre-processing steps that authors of related works take. A second contribution we make is to show improvements in CatBoost’s performance in terms of Area Under the Receiver Operating Characteristic Curve (AUC), when we include another one of the categorical features (provider state) as input to CatBoost. We show that CatBoost attains better performance than XGBoost in the task of Medicare fraud detection with respect to the AUC metric. At a 99% confidence level (with p-value 0) our experiments show that XGBoost obtains a mean AUC value of 0.7615 while CatBoost obtains a mean AUC value of 0.7851, validating the significance of CatBoost’s performance improvement over XGBoost. Moreover, when we include an additional categorical feature (healthcare provider state) in our data analysis, CatBoost yields a mean AUC value of 0.8902, which is also statistically signficant at a 99% confidence interval level (with p-value 0). Our empirical evidence clearly indicates CatBoost is a better alternative to XGBoost for Medicare fraud detection, especially when dealing with categorical features.