Data-Mining Homogeneous Subgroups in Multiple Regression When Heteroscedasticity, Multicollinearity, and Missing Variables Confound Predictor Effects

IF 0.5 Q4 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Advances in Data Science and Adaptive Analysis Pub Date : 2020-09-05 DOI:10.1142/s2424922x20410041

R. Francoeur

{"title":"Data-Mining Homogeneous Subgroups in Multiple Regression When Heteroscedasticity, Multicollinearity, and Missing Variables Confound Predictor Effects","authors":"R. Francoeur","doi":"10.1142/s2424922x20410041","DOIUrl":null,"url":null,"abstract":"Multiple regression is not reliable to recover predictor slopes within homogeneous subgroups from heterogeneous samples. In contrast to Monte Carlo analysis, which assigns completely to the first-specified predictor the variation it shares with the remaining predictors, multiple regression does not assign this shared variation to any predictor, and it is sequestered in the residual term. This unassigned and confounding variation may correlate with specified predictors, lead to heteroscedasticity, and distort multicollinearity. I develop and test an iterative, sequential algorithm to estimate a two-part series of weighted least-square (WLS) multiple regressions for recovering the Monte Carlo predictor slopes in three homogeneous subgroups (each generated with 500 observations) of a heterogeneous sample [Formula: see text]. Each variable has a different nonnormal distribution. The algorithm mines each subgroup and then adjusts bias within it from 1) heteroscedasticity related to one, some, or all specified predictors and 2) “nonessential” multicollinearity. It recovers all three specified predictor slopes across the three subgroups in two scenarios, with one influenced also by two unspecified predictors. The algorithm extends adaptive analysis to discover and appraise patterns in field research and machine learning when predictors are inter-correlated, and even unspecified, in order to reveal unbiased outcome clusters in heterogeneous and homogeneous samples with nonnormal outcome and predictors.","PeriodicalId":47145,"journal":{"name":"Advances in Data Science and Adaptive Analysis","volume":"24 1","pages":"2041004:1-2041004:59"},"PeriodicalIF":0.5000,"publicationDate":"2020-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Data Science and Adaptive Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s2424922x20410041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Multiple regression is not reliable to recover predictor slopes within homogeneous subgroups from heterogeneous samples. In contrast to Monte Carlo analysis, which assigns completely to the first-specified predictor the variation it shares with the remaining predictors, multiple regression does not assign this shared variation to any predictor, and it is sequestered in the residual term. This unassigned and confounding variation may correlate with specified predictors, lead to heteroscedasticity, and distort multicollinearity. I develop and test an iterative, sequential algorithm to estimate a two-part series of weighted least-square (WLS) multiple regressions for recovering the Monte Carlo predictor slopes in three homogeneous subgroups (each generated with 500 observations) of a heterogeneous sample [Formula: see text]. Each variable has a different nonnormal distribution. The algorithm mines each subgroup and then adjusts bias within it from 1) heteroscedasticity related to one, some, or all specified predictors and 2) “nonessential” multicollinearity. It recovers all three specified predictor slopes across the three subgroups in two scenarios, with one influenced also by two unspecified predictors. The algorithm extends adaptive analysis to discover and appraise patterns in field research and machine learning when predictors are inter-correlated, and even unspecified, in order to reveal unbiased outcome clusters in heterogeneous and homogeneous samples with nonnormal outcome and predictors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

当异方差、多重共线性和缺失变量混淆预测效应时，多元回归中同质子群的数据挖掘

多元回归在异质性样本的同质亚群中恢复预测斜率是不可靠的。与蒙特卡罗分析相反，蒙特卡罗分析将其与其他预测因子共享的变化完全分配给第一个指定的预测因子，多元回归不将这种共享的变化分配给任何预测因子，并且它被隔离在残差项中。这种未分配和混杂的变异可能与特定的预测因子相关，导致异方差，并扭曲多重共线性。我开发并测试了一种迭代的顺序算法，用于估计两个部分的加权最小二乘(WLS)多元回归序列，以恢复异质性样本的三个同质子组(每个子组由500个观测值生成)中的蒙特卡罗预测斜率[公式:见文本]。每个变量都有不同的非正态分布。该算法挖掘每个子组，然后调整其中的偏差:1)与一个、一些或所有指定预测因子相关的异方差;2)“非必要的”多重共线性。它在两个场景中恢复了三个子组中所有三个指定的预测斜率，其中一个也受到两个未指定预测因子的影响。该算法扩展了自适应分析，在预测因子相互关联甚至未指定的情况下，发现和评估现场研究和机器学习中的模式，以便在具有非正常结果和预测因子的异质和同质样本中揭示无偏结果簇。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Advances in Data Science and Adaptive Analysis MATHEMATICS, INTERDISCIPLINARY APPLICATIONS-

自引率

0.00%

发文量