{"title":"这是不正常的!(Re-) 评估回归分析的下 $n$ 准则","authors":"David Randahl","doi":"arxiv-2409.06413","DOIUrl":null,"url":null,"abstract":"The commonly cited rule of thumb for regression analysis, which suggests that\na sample size of $n \\geq 30$ is sufficient to ensure valid inferences, is\nfrequently referenced but rarely scrutinized. This research note evaluates the\nlower bound for the number of observations required for regression analysis by\nexploring how different distributional characteristics, such as skewness and\nkurtosis, influence the convergence of t-values to the t-distribution in linear\nregression models. Through an extensive simulation study involving over 22\nbillion regression models, this paper examines a range of symmetric,\nplatykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\nThe results reveal that it is sufficient that either the dependent or\nindependent variable follow a symmetric distribution for the t-values to\nconverge to the t-distribution at much smaller sample sizes than $n=30$. This\nis contrary to previous guidance which suggests that the error term needs to be\nnormally distributed for this convergence to happen at low $n$. On the other\nhand, if both dependent and independent variables are highly skewed the\nrequired sample size is substantially higher. In cases of extreme skewness,\neven sample sizes of 10,000 do not ensure convergence. These findings suggest\nthat the $n\\geq30$ rule is too permissive in certain cases but overly\nconservative in others, depending on the underlying distributional\ncharacteristics. This study offers revised guidelines for determining the\nminimum sample size necessary for valid regression analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis\",\"authors\":\"David Randahl\",\"doi\":\"arxiv-2409.06413\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The commonly cited rule of thumb for regression analysis, which suggests that\\na sample size of $n \\\\geq 30$ is sufficient to ensure valid inferences, is\\nfrequently referenced but rarely scrutinized. This research note evaluates the\\nlower bound for the number of observations required for regression analysis by\\nexploring how different distributional characteristics, such as skewness and\\nkurtosis, influence the convergence of t-values to the t-distribution in linear\\nregression models. Through an extensive simulation study involving over 22\\nbillion regression models, this paper examines a range of symmetric,\\nplatykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\\nThe results reveal that it is sufficient that either the dependent or\\nindependent variable follow a symmetric distribution for the t-values to\\nconverge to the t-distribution at much smaller sample sizes than $n=30$. This\\nis contrary to previous guidance which suggests that the error term needs to be\\nnormally distributed for this convergence to happen at low $n$. On the other\\nhand, if both dependent and independent variables are highly skewed the\\nrequired sample size is substantially higher. In cases of extreme skewness,\\neven sample sizes of 10,000 do not ensure convergence. These findings suggest\\nthat the $n\\\\geq30$ rule is too permissive in certain cases but overly\\nconservative in others, depending on the underlying distributional\\ncharacteristics. This study offers revised guidelines for determining the\\nminimum sample size necessary for valid regression analysis.\",\"PeriodicalId\":501425,\"journal\":{\"name\":\"arXiv - STAT - Methodology\",\"volume\":\"28 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Methodology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06413\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
通常引用的回归分析经验法则认为,样本量为 $n \geq 30$ 就足以确保有效推论,这一法则经常被引用,但却很少被仔细研究。本研究报告通过探讨不同的分布特征(如偏斜度和峰度)如何影响线性回归模型中 t 值向 t 分布的收敛,评估了回归分析所需的观察数下限。本文通过一项涉及超过 220 亿个回归模型的广泛模拟研究,考察了一系列对称分布、偏桔型分布和倾斜分布,测试了 4 到 10,000 个样本量。结果发现,在样本量远小于 $n=30$ 的情况下,因变量或自变量遵循对称分布就足以使 t 值收敛到 t 分布。这与以前的指导相反,以前的指导认为误差项需要呈正态分布,才能在低 $n$ 时收敛。另一方面,如果因变量和自变量都高度偏斜,所需的样本量就会大大增加。在极度偏斜的情况下,即使样本量达到 10,000 个,也不能确保收敛。这些发现表明,$n\geq30$ 规则在某些情况下过于宽松,而在另一些情况下则过于保守,这取决于基本的分布特征。本研究为确定有效回归分析所需的最小样本量提供了修订指南。
This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis
The commonly cited rule of thumb for regression analysis, which suggests that
a sample size of $n \geq 30$ is sufficient to ensure valid inferences, is
frequently referenced but rarely scrutinized. This research note evaluates the
lower bound for the number of observations required for regression analysis by
exploring how different distributional characteristics, such as skewness and
kurtosis, influence the convergence of t-values to the t-distribution in linear
regression models. Through an extensive simulation study involving over 22
billion regression models, this paper examines a range of symmetric,
platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.
The results reveal that it is sufficient that either the dependent or
independent variable follow a symmetric distribution for the t-values to
converge to the t-distribution at much smaller sample sizes than $n=30$. This
is contrary to previous guidance which suggests that the error term needs to be
normally distributed for this convergence to happen at low $n$. On the other
hand, if both dependent and independent variables are highly skewed the
required sample size is substantially higher. In cases of extreme skewness,
even sample sizes of 10,000 do not ensure convergence. These findings suggest
that the $n\geq30$ rule is too permissive in certain cases but overly
conservative in others, depending on the underlying distributional
characteristics. This study offers revised guidelines for determining the
minimum sample size necessary for valid regression analysis.