Kuan Jiang, Wenjie Hu, Shu Yang, Xinxing Lai, Xiaohua Zhou
To estimate the average treatment effect in real-world populations, observational studies are typically designed around real-world cohorts. However, even when study samples from these designs represent the population, unmeasured confounders can introduce bias. Sensitivity analysis is often used to estimate bounds for the average treatment effect without relying on the strict mathematical assumptions of other existing methods. This article introduces a new approach that improves sensitivity analysis in observational studies by incorporating randomized clinical trial data, even with limited overlap due to inclusion/exclusion criteria. Theoretical proof and simulations show that this method provides a tighter bound width than existing approaches. We also apply this method to both a trial dataset and a real-world drug effectiveness comparison dataset for practical analysis.
{"title":"Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap","authors":"Kuan Jiang, Wenjie Hu, Shu Yang, Xinxing Lai, Xiaohua Zhou","doi":"arxiv-2409.07391","DOIUrl":"https://doi.org/arxiv-2409.07391","url":null,"abstract":"To estimate the average treatment effect in real-world populations,\u0000observational studies are typically designed around real-world cohorts.\u0000However, even when study samples from these designs represent the population,\u0000unmeasured confounders can introduce bias. Sensitivity analysis is often used\u0000to estimate bounds for the average treatment effect without relying on the\u0000strict mathematical assumptions of other existing methods. This article\u0000introduces a new approach that improves sensitivity analysis in observational\u0000studies by incorporating randomized clinical trial data, even with limited\u0000overlap due to inclusion/exclusion criteria. Theoretical proof and simulations\u0000show that this method provides a tighter bound width than existing approaches.\u0000We also apply this method to both a trial dataset and a real-world drug\u0000effectiveness comparison dataset for practical analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the XBX regression model, a continuous mixture of extended-support beta regressions for modeling bounded responses with or without boundary observations. The core building block of the new model is the extended-support beta distribution, which is a censored version of a four-parameter beta distribution with the same exceedance on the left and right of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We prove that both beta regression with dispersion effects and heteroscedastic normal regression with censoring at both $0$ and $1$ -- known as the heteroscedastic two-limit tobit model in the econometrics literature -- are special cases of the extended-support beta regression model, depending on whether a single extra parameter is zero or infinity, respectively. To overcome identifiability issues that may arise in estimating the extra parameter due to the similarity of the beta and normal distribution for certain parameter settings, we assume that the additional parameter has an exponential distribution with an unknown mean. The associated marginal likelihood can be conveniently and accurately approximated using a Gauss-Laguerre quadrature rule, resulting in efficient estimation and inference procedures. The new model is used to analyze investment decisions in a behavioral economics experiment, where the occurrence and extent of loss aversion is of interest. In contrast to standard approaches, XBX regression can simultaneously capture the probability of rational behavior as well as the mean amount of loss aversion. Moreover, the effectiveness of the new model is illustrated through extensive numerical comparisons with alternative models.
{"title":"Extended-support beta regression for $[0, 1]$ responses","authors":"Ioannis Kosmidis, Achim Zeileis","doi":"arxiv-2409.07233","DOIUrl":"https://doi.org/arxiv-2409.07233","url":null,"abstract":"We introduce the XBX regression model, a continuous mixture of\u0000extended-support beta regressions for modeling bounded responses with or\u0000without boundary observations. The core building block of the new model is the\u0000extended-support beta distribution, which is a censored version of a\u0000four-parameter beta distribution with the same exceedance on the left and right\u0000of $(0, 1)$. Hence, XBX regression is a direct extension of beta regression. We\u0000prove that both beta regression with dispersion effects and heteroscedastic\u0000normal regression with censoring at both $0$ and $1$ -- known as the\u0000heteroscedastic two-limit tobit model in the econometrics literature -- are\u0000special cases of the extended-support beta regression model, depending on\u0000whether a single extra parameter is zero or infinity, respectively. To overcome\u0000identifiability issues that may arise in estimating the extra parameter due to\u0000the similarity of the beta and normal distribution for certain parameter\u0000settings, we assume that the additional parameter has an exponential\u0000distribution with an unknown mean. The associated marginal likelihood can be\u0000conveniently and accurately approximated using a Gauss-Laguerre quadrature\u0000rule, resulting in efficient estimation and inference procedures. The new model\u0000is used to analyze investment decisions in a behavioral economics experiment,\u0000where the occurrence and extent of loss aversion is of interest. In contrast to\u0000standard approaches, XBX regression can simultaneously capture the probability\u0000of rational behavior as well as the mean amount of loss aversion. Moreover, the\u0000effectiveness of the new model is illustrated through extensive numerical\u0000comparisons with alternative models.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"108 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As part of enhancing the interpretability of machine learning, it is of renewed interest to quantify and infer the predictive importance of certain exposure covariates. Modern scientific studies often collect data from multiple sources with distributional heterogeneity. Thus, measuring and inferring stable associations across multiple environments is crucial in reliable and generalizable decision-making. In this paper, we propose MIMAL, a novel statistical framework for Multi-source stable Importance Measure via Adversarial Learning. MIMAL measures the importance of some exposure variables by maximizing the worst-case predictive reward over the source mixture. Our framework allows various machine learning methods for confounding adjustment and exposure effect characterization. For inferential analysis, the asymptotic normality of our introduced statistic is established under a general machine learning framework that requires no stronger learning accuracy conditions than those for single source variable importance. Numerical studies with various types of data generation setups and machine learning implementation are conducted to justify the finite-sample performance of MIMAL. We also illustrate our method through a real-world study of Beijing air pollution in multiple locations.
{"title":"Multi-source Stable Variable Importance Measure via Adversarial Machine Learning","authors":"Zitao Wang, Nian Si, Zijian Guo, Molei Liu","doi":"arxiv-2409.07380","DOIUrl":"https://doi.org/arxiv-2409.07380","url":null,"abstract":"As part of enhancing the interpretability of machine learning, it is of\u0000renewed interest to quantify and infer the predictive importance of certain\u0000exposure covariates. Modern scientific studies often collect data from multiple\u0000sources with distributional heterogeneity. Thus, measuring and inferring stable\u0000associations across multiple environments is crucial in reliable and\u0000generalizable decision-making. In this paper, we propose MIMAL, a novel\u0000statistical framework for Multi-source stable Importance Measure via\u0000Adversarial Learning. MIMAL measures the importance of some exposure variables\u0000by maximizing the worst-case predictive reward over the source mixture. Our\u0000framework allows various machine learning methods for confounding adjustment\u0000and exposure effect characterization. For inferential analysis, the asymptotic\u0000normality of our introduced statistic is established under a general machine\u0000learning framework that requires no stronger learning accuracy conditions than\u0000those for single source variable importance. Numerical studies with various\u0000types of data generation setups and machine learning implementation are\u0000conducted to justify the finite-sample performance of MIMAL. We also illustrate\u0000our method through a real-world study of Beijing air pollution in multiple\u0000locations.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modeling with multi-omics data presents multiple challenges such as the high-dimensionality of the problem ($p gg n$), the presence of interactions between features, and the need for integration between multiple data sources. We establish an interaction model that allows for the inclusion of multiple sources of data from the integration of two existing methods, pliable lasso and cooperative learning. The integrated model is tested both on simulation studies and on real multi-omics datasets for predicting labor onset and cancer treatment response. The results show that the model is effective in modeling multi-source data in various scenarios where interactions are present, both in terms of prediction performance and selection of relevant variables.
{"title":"Integrating Multiple Data Sources with Interactions in Multi-Omics Using Cooperative Learning","authors":"Matteo D'Alessandro, Theophilus Quachie Asenso, Manuela Zucknick","doi":"arxiv-2409.07125","DOIUrl":"https://doi.org/arxiv-2409.07125","url":null,"abstract":"Modeling with multi-omics data presents multiple challenges such as the\u0000high-dimensionality of the problem ($p gg n$), the presence of interactions\u0000between features, and the need for integration between multiple data sources.\u0000We establish an interaction model that allows for the inclusion of multiple\u0000sources of data from the integration of two existing methods, pliable lasso and\u0000cooperative learning. The integrated model is tested both on simulation studies\u0000and on real multi-omics datasets for predicting labor onset and cancer\u0000treatment response. The results show that the model is effective in modeling\u0000multi-source data in various scenarios where interactions are present, both in\u0000terms of prediction performance and selection of relevant variables.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"195 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop conservative tests for the mean of a bounded population using data from a stratified sample. The sample may be drawn sequentially, with or without replacement. The tests are "anytime valid," allowing optional stopping and continuation in each stratum. We call this combination of properties sequential, finite-sample, nonparametric validity. The methods express a hypothesis about the population mean as a union of intersection hypotheses describing within-stratum means. They test each intersection hypothesis using independent test supermartingales (TSMs) combined across strata by multiplication. The $P$-value of the global null hypothesis is then the maximum $P$-value of any intersection hypothesis in the union. This approach has three primary moving parts: (i) the rule for deciding which stratum to draw from next to test each intersection null, given the sample so far; (ii) the form of the TSM for each null in each stratum; and (iii) the method of combining evidence across strata. These choices interact. We examine the performance of a variety of rules with differing computational complexity. Approximately optimal methods have a prohibitive computational cost, while naive rules may be inconsistent -- they will never reject for some alternative populations, no matter how large the sample. We present a method that is statistically comparable to optimal methods in examples where optimal methods are computable, but computationally tractable for arbitrarily many strata. In numerical examples its expected sample size is substantially smaller than that of previous methods.
{"title":"Sequential stratified inference for the mean","authors":"Jacob V. Spertus, Mayuri Sridhar, Philip B. Stark","doi":"arxiv-2409.06680","DOIUrl":"https://doi.org/arxiv-2409.06680","url":null,"abstract":"We develop conservative tests for the mean of a bounded population using data\u0000from a stratified sample. The sample may be drawn sequentially, with or without\u0000replacement. The tests are \"anytime valid,\" allowing optional stopping and\u0000continuation in each stratum. We call this combination of properties\u0000sequential, finite-sample, nonparametric validity. The methods express a\u0000hypothesis about the population mean as a union of intersection hypotheses\u0000describing within-stratum means. They test each intersection hypothesis using\u0000independent test supermartingales (TSMs) combined across strata by\u0000multiplication. The $P$-value of the global null hypothesis is then the maximum\u0000$P$-value of any intersection hypothesis in the union. This approach has three\u0000primary moving parts: (i) the rule for deciding which stratum to draw from next\u0000to test each intersection null, given the sample so far; (ii) the form of the\u0000TSM for each null in each stratum; and (iii) the method of combining evidence\u0000across strata. These choices interact. We examine the performance of a variety\u0000of rules with differing computational complexity. Approximately optimal methods\u0000have a prohibitive computational cost, while naive rules may be inconsistent --\u0000they will never reject for some alternative populations, no matter how large\u0000the sample. We present a method that is statistically comparable to optimal\u0000methods in examples where optimal methods are computable, but computationally\u0000tractable for arbitrarily many strata. In numerical examples its expected\u0000sample size is substantially smaller than that of previous methods.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The doubly robust estimator, which models both the propensity score and outcomes, is a popular approach to estimate the average treatment effect in the potential outcome setting. The primary appeal of this estimator is its theoretical property, wherein the estimator achieves consistency as long as either the propensity score or outcomes is correctly specified. In most applications, however, both are misspecified, leading to considerable bias that cannot be checked. In this paper, we propose a Bayesian ensemble approach that synthesizes multiple models for both the propensity score and outcomes, which we call doubly robust Bayesian regression synthesis. Our approach applies Bayesian updating to the ensemble model weights that adapt at the unit level, incorporating data heterogeneity, to significantly mitigate misspecification bias. Theoretically, we show that our proposed approach is consistent regarding the estimation of both the propensity score and outcomes, ensuring that the doubly robust estimator is consistent, even if no single model is correctly specified. An efficient algorithm for posterior computation facilitates the characterization of uncertainty regarding the treatment effect. Our proposed approach is compared against standard and state-of-the-art methods through two comprehensive simulation studies, where we find that our approach is superior in all cases. An empirical study on the impact of maternal smoking on birth weight highlights the practical applicability of our proposed method.
{"title":"Ensemble Doubly Robust Bayesian Inference via Regression Synthesis","authors":"Kaoru Babasaki, Shonosuke Sugasawa, Kosaku Takanashi, Kenichiro McAlinn","doi":"arxiv-2409.06288","DOIUrl":"https://doi.org/arxiv-2409.06288","url":null,"abstract":"The doubly robust estimator, which models both the propensity score and\u0000outcomes, is a popular approach to estimate the average treatment effect in the\u0000potential outcome setting. The primary appeal of this estimator is its\u0000theoretical property, wherein the estimator achieves consistency as long as\u0000either the propensity score or outcomes is correctly specified. In most\u0000applications, however, both are misspecified, leading to considerable bias that\u0000cannot be checked. In this paper, we propose a Bayesian ensemble approach that\u0000synthesizes multiple models for both the propensity score and outcomes, which\u0000we call doubly robust Bayesian regression synthesis. Our approach applies\u0000Bayesian updating to the ensemble model weights that adapt at the unit level,\u0000incorporating data heterogeneity, to significantly mitigate misspecification\u0000bias. Theoretically, we show that our proposed approach is consistent regarding\u0000the estimation of both the propensity score and outcomes, ensuring that the\u0000doubly robust estimator is consistent, even if no single model is correctly\u0000specified. An efficient algorithm for posterior computation facilitates the\u0000characterization of uncertainty regarding the treatment effect. Our proposed\u0000approach is compared against standard and state-of-the-art methods through two\u0000comprehensive simulation studies, where we find that our approach is superior\u0000in all cases. An empirical study on the impact of maternal smoking on birth\u0000weight highlights the practical applicability of our proposed method.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many real-world networks, relationships often go beyond simple dyadic presence or absence; they can be positive, like friendship, alliance, and mutualism, or negative, characterized by enmity, disputes, and competition. To understand the formation mechanism of such signed networks, the social balance theory sheds light on the dynamics of positive and negative connections. In particular, it characterizes the proverbs, "a friend of my friend is my friend" and "an enemy of my enemy is my friend". In this work, we propose a nonparametric inference approach for assessing empirical evidence for the balance theory in real-world signed networks. We first characterize the generating process of signed networks with node exchangeability and propose a nonparametric sparse signed graphon model. Under this model, we construct confidence intervals for the population parameters associated with balance theory and establish their theoretical validity. Our inference procedure is as computationally efficient as a simple normal approximation but offers higher-order accuracy. By applying our method, we find strong real-world evidence for balance theory in signed networks across various domains, extending its applicability beyond social psychology.
{"title":"Nonparametric Inference for Balance in Signed Networks","authors":"Xuyang Chen, Yinjie Wang, Weijing Tang","doi":"arxiv-2409.06172","DOIUrl":"https://doi.org/arxiv-2409.06172","url":null,"abstract":"In many real-world networks, relationships often go beyond simple dyadic\u0000presence or absence; they can be positive, like friendship, alliance, and\u0000mutualism, or negative, characterized by enmity, disputes, and competition. To\u0000understand the formation mechanism of such signed networks, the social balance\u0000theory sheds light on the dynamics of positive and negative connections. In\u0000particular, it characterizes the proverbs, \"a friend of my friend is my friend\"\u0000and \"an enemy of my enemy is my friend\". In this work, we propose a\u0000nonparametric inference approach for assessing empirical evidence for the\u0000balance theory in real-world signed networks. We first characterize the\u0000generating process of signed networks with node exchangeability and propose a\u0000nonparametric sparse signed graphon model. Under this model, we construct\u0000confidence intervals for the population parameters associated with balance\u0000theory and establish their theoretical validity. Our inference procedure is as\u0000computationally efficient as a simple normal approximation but offers\u0000higher-order accuracy. By applying our method, we find strong real-world\u0000evidence for balance theory in signed networks across various domains,\u0000extending its applicability beyond social psychology.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.
{"title":"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":"https://doi.org/arxiv-2409.06180","url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\u0000advancing personalized medicine. Achieving this goal necessitates determining a\u0000suitable sample size that ensures adequate statistical power without undue\u0000resource allocation. Current sample size calculation methods rely on\u0000assumptions and algorithms that may not align with supervised machine learning\u0000techniques for sample classification. Addressing this critical methodological\u0000gap, we present a novel computational approach that establishes the\u0000power-versus-sample-size relationship by employing a data augmentation strategy\u0000followed by fitting a learning curve. We comprehensively evaluated its\u0000performance for microRNA and RNA sequencing data, considering diverse data\u0000characteristics and algorithm configurations, based on a spectrum of evaluation\u0000metrics. To foster accessibility and reproducibility, the Python and R code for\u0000implementing our approach is available on GitHub. Its deployment will\u0000significantly facilitate the adoption of machine learning in transcriptomics\u0000studies and accelerate their translation into clinically useful classifiers for\u0000personalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current theory of global sensitivity analysis, based on a nonlinear functional ANOVA decomposition of the random output, is limited in scope-for instance, the analysis is limited to the output's variance and the inputs have to be mutually independent-and leads to sensitivity indices the interpretation of which is not fully clear, especially interaction effects. Alternatively, sensitivity indices built for arbitrary user-defined importance measures have been proposed but a theory to define interactions in a systematic fashion and/or establish a decomposition of the total importance measure is still missing. It is shown that these important problems are solved all at once by adopting a new paradigm. By partitioning the inputs into those causing the change in the output and those which do not, arbitrary user-defined variability measures are identified with the outcomes of a factorial experiment at two levels, leading to all factorial effects without assuming any functional decomposition. To link various well-known sensitivity indices of the literature (Sobol indices and Shapley effects), weighted factorial effects are studied and utilized.
{"title":"A new paradigm for global sensitivity analysis","authors":"Gildas MazoMaIAGE","doi":"arxiv-2409.06271","DOIUrl":"https://doi.org/arxiv-2409.06271","url":null,"abstract":"<div><p>Current theory of global sensitivity analysis, based on a nonlinear\u0000functional ANOVA decomposition of the random output, is limited in scope-for\u0000instance, the analysis is limited to the output's variance and the inputs have\u0000to be mutually independent-and leads to sensitivity indices the interpretation\u0000of which is not fully clear, especially interaction effects. Alternatively,\u0000sensitivity indices built for arbitrary user-defined importance measures have\u0000been proposed but a theory to define interactions in a systematic fashion\u0000and/or establish a decomposition of the total importance measure is still\u0000missing. It is shown that these important problems are solved all at once by\u0000adopting a new paradigm. By partitioning the inputs into those causing the\u0000change in the output and those which do not, arbitrary user-defined variability\u0000measures are identified with the outcomes of a factorial experiment at two\u0000levels, leading to all factorial effects without assuming any functional\u0000decomposition. To link various well-known sensitivity indices of the literature\u0000(Sobol indices and Shapley effects), weighted factorial effects are studied and\u0000utilized.</p></div>","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The commonly cited rule of thumb for regression analysis, which suggests that a sample size of $n geq 30$ is sufficient to ensure valid inferences, is frequently referenced but rarely scrutinized. This research note evaluates the lower bound for the number of observations required for regression analysis by exploring how different distributional characteristics, such as skewness and kurtosis, influence the convergence of t-values to the t-distribution in linear regression models. Through an extensive simulation study involving over 22 billion regression models, this paper examines a range of symmetric, platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000. The results reveal that it is sufficient that either the dependent or independent variable follow a symmetric distribution for the t-values to converge to the t-distribution at much smaller sample sizes than $n=30$. This is contrary to previous guidance which suggests that the error term needs to be normally distributed for this convergence to happen at low $n$. On the other hand, if both dependent and independent variables are highly skewed the required sample size is substantially higher. In cases of extreme skewness, even sample sizes of 10,000 do not ensure convergence. These findings suggest that the $ngeq30$ rule is too permissive in certain cases but overly conservative in others, depending on the underlying distributional characteristics. This study offers revised guidelines for determining the minimum sample size necessary for valid regression analysis.
通常引用的回归分析经验法则认为,样本量为 $n geq 30$ 就足以确保有效推论,这一法则经常被引用,但却很少被仔细研究。本研究报告通过探讨不同的分布特征(如偏斜度和峰度)如何影响线性回归模型中 t 值向 t 分布的收敛,评估了回归分析所需的观察数下限。本文通过一项涉及超过 220 亿个回归模型的广泛模拟研究,考察了一系列对称分布、偏桔型分布和倾斜分布,测试了 4 到 10,000 个样本量。结果发现,在样本量远小于 $n=30$ 的情况下,因变量或自变量遵循对称分布就足以使 t 值收敛到 t 分布。这与以前的指导相反,以前的指导认为误差项需要呈正态分布,才能在低 $n$ 时收敛。另一方面,如果因变量和自变量都高度偏斜,所需的样本量就会大大增加。在极度偏斜的情况下,即使样本量达到 10,000 个,也不能确保收敛。这些发现表明,$ngeq30$ 规则在某些情况下过于宽松,而在另一些情况下则过于保守,这取决于基本的分布特征。本研究为确定有效回归分析所需的最小样本量提供了修订指南。
{"title":"This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis","authors":"David Randahl","doi":"arxiv-2409.06413","DOIUrl":"https://doi.org/arxiv-2409.06413","url":null,"abstract":"The commonly cited rule of thumb for regression analysis, which suggests that\u0000a sample size of $n geq 30$ is sufficient to ensure valid inferences, is\u0000frequently referenced but rarely scrutinized. This research note evaluates the\u0000lower bound for the number of observations required for regression analysis by\u0000exploring how different distributional characteristics, such as skewness and\u0000kurtosis, influence the convergence of t-values to the t-distribution in linear\u0000regression models. Through an extensive simulation study involving over 22\u0000billion regression models, this paper examines a range of symmetric,\u0000platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\u0000The results reveal that it is sufficient that either the dependent or\u0000independent variable follow a symmetric distribution for the t-values to\u0000converge to the t-distribution at much smaller sample sizes than $n=30$. This\u0000is contrary to previous guidance which suggests that the error term needs to be\u0000normally distributed for this convergence to happen at low $n$. On the other\u0000hand, if both dependent and independent variables are highly skewed the\u0000required sample size is substantially higher. In cases of extreme skewness,\u0000even sample sizes of 10,000 do not ensure convergence. These findings suggest\u0000that the $ngeq30$ rule is too permissive in certain cases but overly\u0000conservative in others, depending on the underlying distributional\u0000characteristics. This study offers revised guidelines for determining the\u0000minimum sample size necessary for valid regression analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142196633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}