基于机器学习的健康与衰老脑研究的多重归算方法——健康差异

IF 3.4 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Informatics Pub Date : 2023-10-11 DOI:10.3390/informatics10040077
Fan Zhang, Melissa Petersen, Leigh Johnson, James Hall, Raymond F. Palmer, Sid E. O’Bryant
{"title":"基于机器学习的健康与衰老脑研究的多重归算方法——健康差异","authors":"Fan Zhang, Melissa Petersen, Leigh Johnson, James Hall, Raymond F. Palmer, Sid E. O’Bryant","doi":"10.3390/informatics10040077","DOIUrl":null,"url":null,"abstract":"The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses.","PeriodicalId":37100,"journal":{"name":"Informatics","volume":"40 1","pages":"0"},"PeriodicalIF":3.4000,"publicationDate":"2023-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities\",\"authors\":\"Fan Zhang, Melissa Petersen, Leigh Johnson, James Hall, Raymond F. Palmer, Sid E. O’Bryant\",\"doi\":\"10.3390/informatics10040077\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses.\",\"PeriodicalId\":37100,\"journal\":{\"name\":\"Informatics\",\"volume\":\"40 1\",\"pages\":\"0\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2023-10-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/informatics10040077\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/informatics10040077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

健康与衰老大脑研究-健康差异(HABS-HD)项目旨在了解影响不同社区大脑衰老的生物、社会和环境因素。HABS-HD的一个常见问题是丢失数据。如果数据包含缺失值,则不可能实现准确的机器学习(ML)。因此,开发一种新的归算方法已成为HABS-HD的紧迫任务。三种缺失数据假设(1)完全随机缺失(MCAR),(2)随机缺失(MAR)和(3)非随机缺失(MNAR),需要对每种缺失机制采用不同的imputation方法。几种流行的归算方法,包括列表删除、最小值、均值、预测均值匹配(PMM)、分类和回归树(CART)和missForest,在应用于下游分析(如检验与临床变量相关的假设或利用机器学习预测AD或MCI)时,可能会导致结果偏倚和统计能力降低。此外,如果不考虑缺失机制,或者在HABS-HD中,如果代入方法与缺失数据机制之间存在不一致,这些常用的代入技术可能会产生不可靠的缺失值估计。因此,我们提出了一个三步处理HABS-HD缺失数据的工作流程:(1)缺失数据评估,(2)输入,(3)输入评估。首先,我们探讨了HABS-HD的缺失。然后,我们开发了一种基于机器学习的多重输入方法(MLMI)来输入缺失值。构建了支持向量机(SVM)、随机森林(RF)、极端梯度增强(XGB)和lasso和elastic-net正则化广义线性模型(GLMNET) 4个基于ml的插值模型,并采用简单平均法对4个基于ml的模型进行了多次插值。最后,我们对MLMI与其他常用方法进行了评价和比较。我们的研究结果表明,三步工作流程可以很好地处理HABS-HD中的缺失值,基于ml的多重插值方法在预测性能和分布和相关性变化方面优于其他常用方法。缺失处理方法的选择对伴随的HABS-HD统计分析有显著影响。概念上的三步工作流程和基于ml的多重归算方法在我们的阿尔茨海默病模型中表现良好。它们也可以应用于其他疾病数据分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities
The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Informatics
Informatics Social Sciences-Communication
CiteScore
6.60
自引率
6.50%
发文量
88
审稿时长
6 weeks
期刊最新文献
Simulation of discrete control systems with parallelism of behavior Formal description model and conditions for detecting linked coupling faults of the memory devices A model of homographs automatic identification for the Belarusian language Ontological analysis in the problems of container applications threat modelling Closed Gordon – Newell network with single-line poles and exponentially limited request waiting time
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1