Applying machine learning approaches for predicting obesity risk using US health administrative claims database.

IF 3.7 2区医学 Q2 ENDOCRINOLOGY & METABOLISM BMJ Open Diabetes Research & Care Pub Date : 2024-09-26 DOI:10.1136/bmjdrc-2024-004193

Casey Choong, Alan Brnabic, Chanadda Chinthammit, Meena Ravuri, Kendra Terrell, Hong Kan

{"title":"Applying machine learning approaches for predicting obesity risk using US health administrative claims database.","authors":"Casey Choong, Alan Brnabic, Chanadda Chinthammit, Meena Ravuri, Kendra Terrell, Hong Kan","doi":"10.1136/bmjdrc-2024-004193","DOIUrl":null,"url":null,"abstract":"Introduction: Body mass index (BMI) is inadequately recorded in US administrative claims databases. We aimed to validate the sensitivity and positive predictive value (PPV) of BMI-related diagnosis codes using an electronic medical records (EMR) claims-linked database. Additionally, we applied machine learning (ML) to identify features in US claims databases to predict obesity status.Research design and methods: This observational, retrospective analysis included 692 119 people ≥18 years of age, with ≥1 BMI reading in MarketScan Explorys Claims-EMR data (January 2013-December 2019). Claims-based obesity status was compared with EMR-based BMI (gold standard) to assess BMI-related diagnosis code sensitivity and PPV. Logistic regression (LR), penalized LR with L1 penalty (Least Absolute Shrinkage and Selection Operator), extreme gradient boosting (XGBoost) and random forest, with features drawn from insurance claims, were trained to predict obesity status (BMI≥30 kg/m2) from EMR as the gold standard. Model performance was compared using several metrics, including the area under the receiver operating characteristic curve. The best-performing model was applied to assess feature importance. Obesity risk scores were computed from the best model generated from the claims database and compared against the BMI recorded in the EMR.Results: The PPV of diagnosis codes from claims alone remained high over the study period (85.4-89.2%); sensitivity was low (16.8-44.8%). XGBoost performed the best at predicting obesity with the highest area under the curve (AUC; 79.4%) and the lowest Brier score. The number of obesity diagnoses and obesity diagnoses from inpatient settings were the most important predictors of obesity. XGBoost showed an AUC of 74.1% when trained without an obesity diagnosis.Conclusions: Obesity prevalence is under-reported in claims databases. ML models, with or without explicit obesity, show promise in improving obesity prediction accuracy compared with obesity codes alone. Improved obesity status prediction may assist practitioners and payors to estimate the burden of obesity and investigate the potential unmet needs of current treatments.","PeriodicalId":9151,"journal":{"name":"BMJ Open Diabetes Research & Care","volume":"12 5","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11429277/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Open Diabetes Research & Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjdrc-2024-004193","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Body mass index (BMI) is inadequately recorded in US administrative claims databases. We aimed to validate the sensitivity and positive predictive value (PPV) of BMI-related diagnosis codes using an electronic medical records (EMR) claims-linked database. Additionally, we applied machine learning (ML) to identify features in US claims databases to predict obesity status.

Research design and methods: This observational, retrospective analysis included 692 119 people ≥18 years of age, with ≥1 BMI reading in MarketScan Explorys Claims-EMR data (January 2013-December 2019). Claims-based obesity status was compared with EMR-based BMI (gold standard) to assess BMI-related diagnosis code sensitivity and PPV. Logistic regression (LR), penalized LR with L1 penalty (Least Absolute Shrinkage and Selection Operator), extreme gradient boosting (XGBoost) and random forest, with features drawn from insurance claims, were trained to predict obesity status (BMI≥30 kg/m²) from EMR as the gold standard. Model performance was compared using several metrics, including the area under the receiver operating characteristic curve. The best-performing model was applied to assess feature importance. Obesity risk scores were computed from the best model generated from the claims database and compared against the BMI recorded in the EMR.

Results: The PPV of diagnosis codes from claims alone remained high over the study period (85.4-89.2%); sensitivity was low (16.8-44.8%). XGBoost performed the best at predicting obesity with the highest area under the curve (AUC; 79.4%) and the lowest Brier score. The number of obesity diagnoses and obesity diagnoses from inpatient settings were the most important predictors of obesity. XGBoost showed an AUC of 74.1% when trained without an obesity diagnosis.

Conclusions: Obesity prevalence is under-reported in claims databases. ML models, with or without explicit obesity, show promise in improving obesity prediction accuracy compared with obesity codes alone. Improved obesity status prediction may assist practitioners and payors to estimate the burden of obesity and investigate the potential unmet needs of current treatments.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用美国健康管理索赔数据库，采用机器学习方法预测肥胖风险。

导言：美国行政索赔数据库对体重指数（BMI）的记录不足。我们的目的是利用与电子病历（EMR）索赔相关联的数据库，验证 BMI 相关诊断代码的灵敏度和阳性预测值（PPV）。此外，我们还应用机器学习（ML）来识别美国索赔数据库中预测肥胖状态的特征：这项观察性、回顾性分析包括 MarketScan Explorys Claims-EMR 数据（2013 年 1 月至 2019 年 12 月）中年龄≥18 岁、BMI 值≥1 的 692 119 人。将基于索赔的肥胖状态与基于 EMR 的 BMI（金标准）进行比较，以评估 BMI 相关诊断代码的灵敏度和 PPV。训练了逻辑回归（LR）、带有 L1 惩罚（最小绝对缩减和选择操作符）的惩罚性 LR、极端梯度提升（XGBoost）和随机森林，其特征来自保险理赔，以预测作为金标准的 EMR 中的肥胖状态（BMI≥30 kg/m2）。使用接收者工作特征曲线下面积等多个指标对模型性能进行了比较。采用表现最好的模型来评估特征的重要性。根据理赔数据库生成的最佳模型计算肥胖风险评分，并与 EMR 中记录的 BMI 进行比较：在研究期间，仅从报销单中获得的诊断代码的 PPV 值仍然很高（85.4%-89.2%）；灵敏度较低（16.8%-44.8%）。XGBoost 在预测肥胖症方面表现最佳，曲线下面积（AUC；79.4%）最高，Brier 评分最低。肥胖诊断次数和来自住院环境的肥胖诊断是最重要的肥胖预测因素。在没有肥胖诊断的情况下，XGBoost 的 AUC 为 74.1%：结论：理赔数据库对肥胖症患病率的报告不足。与单纯的肥胖代码相比，有无明确肥胖的 ML 模型都有望提高肥胖预测的准确性。改进肥胖状况预测可帮助从业人员和付款人估算肥胖的负担，并调查当前治疗未满足的潜在需求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

BMJ Open Diabetes Research & Care Medicine-Endocrinology, Diabetes and Metabolism

CiteScore

9.30

自引率

2.40%

发文量

123

审稿时长

18 weeks

期刊介绍： BMJ Open Diabetes Research & Care is an open access journal committed to publishing high-quality, basic and clinical research articles regarding type 1 and type 2 diabetes, and associated complications. Only original content will be accepted, and submissions are subject to rigorous peer review to ensure the publication of high-quality — and evidence-based — original research articles.