{"title":"Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction","authors":"Justin M. Johnson, T. Khoshgoftaar","doi":"10.1109/CIC50333.2020.00026","DOIUrl":null,"url":null,"abstract":"This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.","PeriodicalId":265435,"journal":{"name":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 6th International Conference on Collaboration and Internet Computing (CIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIC50333.2020.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
This study evaluates semantic healthcare procedure code embeddings on a Medicare fraud classification problem using publicly available big data. Traditionally, categorical Medicare features are one-hot encoded for the purpose of supervised learning. One-hot encoding thousands of unique procedure codes leads to high-dimensional vectors that increase model complexity and fail to capture the inherent relationships between codes. We address these shortcomings by representing procedure codes using low-rank continuous vectors that capture various dimensions of similarity. We leverage publicly available data from the Centers for Medicare and Medicaid Services, with more than 56 million claims records, and train Word2Vec models on sequences of co-occurring codes from the Healthcare Common Procedure Coding System (HCPCS). Continuous-bag-of-words and skip-gram embed-dings are trained using a range of embedding and window sizes. The proposed embeddings are empirically evaluated on a Medicare fraud classification problem using the Extreme Gradient Boosting learner. Results are compared to both one-hot encodings and pre-trained embeddings from related works using the area under the receiver operating characteristic curve and geometric mean metrics. Statistical tests are used to show that the proposed embeddings significantly outperform one-hot encodings with 95% confidence. In addition to our empirical analysis, we briefly evaluate the quality of the learned embeddings by exploring nearest neighbors in vector space. To the best of our knowledge, this is the first study to train and evaluate HCPCS procedure embeddings on big Medicare data.