Data-Mining Homogeneous Subgroups in Multiple Regression When Heteroscedasticity, Multicollinearity, and Missing Variables Confound Predictor Effects

IF 0.5 Q4 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Advances in Data Science and Adaptive Analysis Pub Date : 2020-09-05 DOI:10.1142/s2424922x20410041
R. Francoeur
{"title":"Data-Mining Homogeneous Subgroups in Multiple Regression When Heteroscedasticity, Multicollinearity, and Missing Variables Confound Predictor Effects","authors":"R. Francoeur","doi":"10.1142/s2424922x20410041","DOIUrl":null,"url":null,"abstract":"Multiple regression is not reliable to recover predictor slopes within homogeneous subgroups from heterogeneous samples. In contrast to Monte Carlo analysis, which assigns completely to the first-specified predictor the variation it shares with the remaining predictors, multiple regression does not assign this shared variation to any predictor, and it is sequestered in the residual term. This unassigned and confounding variation may correlate with specified predictors, lead to heteroscedasticity, and distort multicollinearity. I develop and test an iterative, sequential algorithm to estimate a two-part series of weighted least-square (WLS) multiple regressions for recovering the Monte Carlo predictor slopes in three homogeneous subgroups (each generated with 500 observations) of a heterogeneous sample [Formula: see text]. Each variable has a different nonnormal distribution. The algorithm mines each subgroup and then adjusts bias within it from 1) heteroscedasticity related to one, some, or all specified predictors and 2) “nonessential” multicollinearity. It recovers all three specified predictor slopes across the three subgroups in two scenarios, with one influenced also by two unspecified predictors. The algorithm extends adaptive analysis to discover and appraise patterns in field research and machine learning when predictors are inter-correlated, and even unspecified, in order to reveal unbiased outcome clusters in heterogeneous and homogeneous samples with nonnormal outcome and predictors.","PeriodicalId":47145,"journal":{"name":"Advances in Data Science and Adaptive Analysis","volume":"24 1","pages":"2041004:1-2041004:59"},"PeriodicalIF":0.5000,"publicationDate":"2020-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Science and Adaptive Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2424922x20410041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Multiple regression is not reliable to recover predictor slopes within homogeneous subgroups from heterogeneous samples. In contrast to Monte Carlo analysis, which assigns completely to the first-specified predictor the variation it shares with the remaining predictors, multiple regression does not assign this shared variation to any predictor, and it is sequestered in the residual term. This unassigned and confounding variation may correlate with specified predictors, lead to heteroscedasticity, and distort multicollinearity. I develop and test an iterative, sequential algorithm to estimate a two-part series of weighted least-square (WLS) multiple regressions for recovering the Monte Carlo predictor slopes in three homogeneous subgroups (each generated with 500 observations) of a heterogeneous sample [Formula: see text]. Each variable has a different nonnormal distribution. The algorithm mines each subgroup and then adjusts bias within it from 1) heteroscedasticity related to one, some, or all specified predictors and 2) “nonessential” multicollinearity. It recovers all three specified predictor slopes across the three subgroups in two scenarios, with one influenced also by two unspecified predictors. The algorithm extends adaptive analysis to discover and appraise patterns in field research and machine learning when predictors are inter-correlated, and even unspecified, in order to reveal unbiased outcome clusters in heterogeneous and homogeneous samples with nonnormal outcome and predictors.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
当异方差、多重共线性和缺失变量混淆预测效应时,多元回归中同质子群的数据挖掘
多元回归在异质性样本的同质亚群中恢复预测斜率是不可靠的。与蒙特卡罗分析相反,蒙特卡罗分析将其与其他预测因子共享的变化完全分配给第一个指定的预测因子,多元回归不将这种共享的变化分配给任何预测因子,并且它被隔离在残差项中。这种未分配和混杂的变异可能与特定的预测因子相关,导致异方差,并扭曲多重共线性。我开发并测试了一种迭代的顺序算法,用于估计两个部分的加权最小二乘(WLS)多元回归序列,以恢复异质性样本的三个同质子组(每个子组由500个观测值生成)中的蒙特卡罗预测斜率[公式:见文本]。每个变量都有不同的非正态分布。该算法挖掘每个子组,然后调整其中的偏差:1)与一个、一些或所有指定预测因子相关的异方差;2)“非必要的”多重共线性。它在两个场景中恢复了三个子组中所有三个指定的预测斜率,其中一个也受到两个未指定预测因子的影响。该算法扩展了自适应分析,在预测因子相互关联甚至未指定的情况下,发现和评估现场研究和机器学习中的模式,以便在具有非正常结果和预测因子的异质和同质样本中揭示无偏结果簇。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Advances in Data Science and Adaptive Analysis
Advances in Data Science and Adaptive Analysis MATHEMATICS, INTERDISCIPLINARY APPLICATIONS-
自引率
0.00%
发文量
13
期刊最新文献
Assessment Of Mars Analog Habitation Plans Using Network Analysis Methodologies A Novel Genetic-Inspired Binary Firefly Algorithm for Feature Selection in the Prediction of Cervical Cancer Big Data Analytics for Predictive System Maintenance Using Machine Learning Models Data Mining for Estimating the Impact of Physical Activity Levels on the Health-Related Well-Being A Novel Autoencoder Deep Architecture for Detecting the Outlier in Heterogeneous Data Sets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1