Assessing the impact on quality of prediction and inference from balancing in multilevel logistic regression

Carolina Gonzalez-Canas , Gustavo A. Valencia-Zapata , Ana Maria Estrada Gomez , Zachary Hass
{"title":"Assessing the impact on quality of prediction and inference from balancing in multilevel logistic regression","authors":"Carolina Gonzalez-Canas ,&nbsp;Gustavo A. Valencia-Zapata ,&nbsp;Ana Maria Estrada Gomez ,&nbsp;Zachary Hass","doi":"10.1016/j.health.2024.100359","DOIUrl":null,"url":null,"abstract":"<div><p>The primary goal of this research is to examine the impact of balancing data on the prediction quality and inference in multilevel logistic regression models. Logistic regression is a valuable approach for modeling binary outcomes expected in health applications. The class imbalance problem, where one of the two outcome categories occurs much more often than the other, is common in healthcare data, such as when modeling the risk factors for rare diseases. The issue is particularly relevant for medical data that contains individual measurements and other data sources measured at a geographic region level, such as environmental risk factors. For this work, both prediction and model interpretation are of interest. A simulation model is proposed to test the impact of balancing strategies on the logistic multilevel model's parameter estimation, inference, and predictive performance. The simulated information emulates characteristics of a Gestational Diabetes Mellitus (GDM) dataset from Indiana's Medicaid program. Several datasets were simulated with varying levels of complexity, involving the balance of the outcome variable and predictors. These datasets exhibited high- or low-frequency occurrences in specific intersections of variables, often called ‘cells.’ The impact of the balancing strategies on prediction and inference was assessed using different techniques, such as the Equivalence (TOST) Test, power analysis, and predictive measures. To the best of our knowledge, this is the first research that explores the impact of using balanced samples on coefficient estimation and prediction measures when using logistic multilevel modeling, finding evidence about the benefits of using balanced samples in this context.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"6 ","pages":"Article 100359"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442524000613/pdfft?md5=61d70749e6aeada54ee254cabcd3c429&pid=1-s2.0-S2772442524000613-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442524000613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The primary goal of this research is to examine the impact of balancing data on the prediction quality and inference in multilevel logistic regression models. Logistic regression is a valuable approach for modeling binary outcomes expected in health applications. The class imbalance problem, where one of the two outcome categories occurs much more often than the other, is common in healthcare data, such as when modeling the risk factors for rare diseases. The issue is particularly relevant for medical data that contains individual measurements and other data sources measured at a geographic region level, such as environmental risk factors. For this work, both prediction and model interpretation are of interest. A simulation model is proposed to test the impact of balancing strategies on the logistic multilevel model's parameter estimation, inference, and predictive performance. The simulated information emulates characteristics of a Gestational Diabetes Mellitus (GDM) dataset from Indiana's Medicaid program. Several datasets were simulated with varying levels of complexity, involving the balance of the outcome variable and predictors. These datasets exhibited high- or low-frequency occurrences in specific intersections of variables, often called ‘cells.’ The impact of the balancing strategies on prediction and inference was assessed using different techniques, such as the Equivalence (TOST) Test, power analysis, and predictive measures. To the best of our knowledge, this is the first research that explores the impact of using balanced samples on coefficient estimation and prediction measures when using logistic multilevel modeling, finding evidence about the benefits of using balanced samples in this context.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估多级逻辑回归中的平衡对预测和推断质量的影响
这项研究的主要目的是考察平衡数据对多层次逻辑回归模型的预测质量和推断的影响。逻辑回归是一种对健康应用中预期的二元结果进行建模的重要方法。类不平衡问题,即两个结果类别中的一个类别比另一个类别出现得更频繁,在医疗数据中很常见,例如在对罕见疾病的风险因素建模时。这个问题对于包含个人测量数据和其他在地理区域层面测量的数据源(如环境风险因素)的医疗数据尤为重要。在这项工作中,预测和模型解释都很重要。我们提出了一个仿真模型来测试平衡策略对逻辑多层次模型的参数估计、推理和预测性能的影响。模拟信息模仿了印第安纳州医疗补助计划中妊娠糖尿病(GDM)数据集的特征。模拟的几个数据集具有不同程度的复杂性,涉及结果变量和预测因子的平衡。这些数据集在变量的特定交叉点(通常称为 "单元")上显示出高频或低频的出现。平衡策略对预测和推理的影响通过不同的技术进行了评估,如等效性(TOST)测试、功率分析和预测措施。据我们所知,这是第一项探索在使用逻辑多层次建模时使用平衡样本对系数估计和预测指标的影响的研究,发现了在这种情况下使用平衡样本的好处。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Healthcare analytics (New York, N.Y.)
Healthcare analytics (New York, N.Y.) Applied Mathematics, Modelling and Simulation, Nursing and Health Professions (General)
CiteScore
4.40
自引率
0.00%
发文量
0
审稿时长
79 days
期刊最新文献
Optimized early fusion of handcrafted and deep learning descriptors for voice pathology detection and classification A deep neural network model with spectral correlation function for electrocardiogram classification and diagnosis of atrial fibrillation An ensemble convolutional neural network model for brain stroke prediction using brain computed tomography images A hierarchical Bayesian approach for identifying socioeconomic factors influencing self-rated health in Japan An electrocardiogram signal classification using a hybrid machine learning and deep learning approach
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1