Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods.

IF 2 JMIR AI Pub Date : 2024-08-29 DOI:10.2196/58455

Arnold Kamis, Nidhi Gadia, Zilin Luo, Shu Xin Ng, Mansi Thumbar

{"title":"Obtaining the Most Accurate, Explainable Model for Predicting Chronic Obstructive Pulmonary Disease: Triangulation of Multiple Linear Regression and Machine Learning Methods.","authors":"Arnold Kamis, Nidhi Gadia, Zilin Luo, Shu Xin Ng, Mansi Thumbar","doi":"10.2196/58455","DOIUrl":null,"url":null,"abstract":"Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.Objective: We gathered a diverse set of non-personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.Methods: We integrated non-personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"3 ","pages":"e58455"},"PeriodicalIF":2.0000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393512/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/58455","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Lung disease is a severe problem in the United States. Despite the decreasing rates of cigarette smoking, chronic obstructive pulmonary disease (COPD) continues to be a health burden in the United States. In this paper, we focus on COPD in the United States from 2016 to 2019.

Objective: We gathered a diverse set of non-personally identifiable information from public data sources to better understand and predict COPD rates at the core-based statistical area (CBSA) level in the United States. Our objective was to compare linear models with machine learning models to obtain the most accurate and interpretable model of COPD.

Methods: We integrated non-personally identifiable information from multiple Centers for Disease Control and Prevention sources and used them to analyze COPD with different types of methods. We included cigarette smoking, a well-known contributing factor, and race/ethnicity because health disparities among different races and ethnicities in the United States are also well known. The models also included the air quality index, education, employment, and economic variables. We fitted models with both multiple linear regression and machine learning methods.

Results: The most accurate multiple linear regression model has variance explained of 81.1%, mean absolute error of 0.591, and symmetric mean absolute percentage error of 9.666. The most accurate machine learning model has variance explained of 85.7%, mean absolute error of 0.456, and symmetric mean absolute percentage error of 6.956. Overall, cigarette smoking and household income are the strongest predictor variables. Moderately strong predictors include education level and unemployment level, as well as American Indian or Alaska Native, Black, and Hispanic population percentages, all measured at the CBSA level.

Conclusions: This research highlights the importance of using diverse data sources as well as multiple methods to understand and predict COPD. The most accurate model was a gradient boosted tree, which captured nonlinearities in a model whose accuracy is superior to the best multiple linear regression. Our interpretable models suggest ways that individual predictor variables can be used in tailored interventions aimed at decreasing COPD rates in specific demographic and ethnographic communities. Gaps in understanding the health impacts of poor air quality, particularly in relation to climate change, suggest a need for further research to design interventions and improve public health.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

获得预测慢性阻塞性肺病的最准确、最可解释的模型：多重线性回归和机器学习方法的三角分析。

背景：肺病是美国的一个严重问题。尽管吸烟率不断下降，慢性阻塞性肺疾病（COPD）仍然是美国的健康负担。在本文中，我们重点关注 2016 年至 2019 年美国的慢性阻塞性肺病：我们从公共数据来源收集了各种非个人身份信息，以更好地了解和预测美国核心统计区（CBSA）一级的慢性阻塞性肺病发病率。我们的目标是比较线性模型和机器学习模型，以获得最准确、最可解释的慢性阻塞性肺病模型：我们整合了疾病控制和预防中心多个来源的非个人身份信息，并利用这些信息采用不同类型的方法分析慢性阻塞性肺病。我们将众所周知的致病因素--吸烟和种族/人种包括在内，因为美国不同种族和人种之间的健康差异也是众所周知的。模型还包括空气质量指数、教育、就业和经济变量。我们使用多元线性回归和机器学习方法对模型进行了拟合：最准确的多元线性回归模型的解释方差为 81.1%，平均绝对误差为 0.591，对称平均绝对百分比误差为 9.666。最准确的机器学习模型的方差解释率为 85.7%，平均绝对误差为 0.456，对称平均绝对百分比误差为 6.956。总体而言，吸烟和家庭收入是最强的预测变量。中等强度的预测变量包括教育水平和失业水平，以及美国印第安人或阿拉斯加原住民、黑人和西班牙裔人口的百分比，所有这些都是在 CBSA 层面上测量的：这项研究强调了使用多种数据来源和多种方法来了解和预测慢性阻塞性肺病的重要性。最准确的模型是梯度提升树，它捕捉到了模型中的非线性因素，其准确性优于最佳的多元线性回归。我们的可解释模型提出了一些方法，可将单个预测变量用于量身定制的干预措施，以降低特定人口和人种学社区的慢性阻塞性肺病发病率。在了解空气质量差对健康的影响（尤其是与气候变化的关系）方面存在的差距表明，有必要开展进一步研究，以设计干预措施和改善公共健康。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

JMIR AI

自引率

0.00%

发文量