How should educational research respond to the replication “crisis” in the social sciences? Reflections on the papers in the Special Issue

IF 2.3 Q1 EDUCATION & EDUCATIONAL RESEARCH Educational Research and Evaluation Pub Date : 2022-01-31 DOI:10.1080/13803611.2021.2022309

D. Wiliam

{"title":"How should educational research respond to the replication “crisis” in the social sciences? Reflections on the papers in the Special Issue","authors":"D. Wiliam","doi":"10.1080/13803611.2021.2022309","DOIUrl":null,"url":null,"abstract":"For anyone who understands the logic of null-hypothesis significance testing, the so-called “replication crisis” in the behavioural sciences (Bryan et al., 2021) would not have come as much of a surprise. Since the pioneering work of Carlo Bonferroni (1935) – and subsequent work in the 1950s by Henry Scheffé (1953), John Tukey (1953/1994), and Olive Jean Dunn (1961) – statisticians have repeatedly pointed out the logically obvious fact that the probability of making a Type I error (mistakenly rejecting the null hypothesis) increases when multiple comparisons are made. And yet, studies in leading psychology and education journals commonly present dozens if not hundreds of comparisons of means, correlations, or other statistics, and then go on to claim that any statistic that has a probability of less than 0.05 is “significant”. However, as Gelman and Loken (2013) point out, even when researchers do not engage in such “fishing expeditions”, if decisions about the analysis are made after the data are collected – “hypothesizing after results are known” or “HARKing” (Kerr, 1998) – then the probability of Type 1 errors is increased. At each stage in the analysis, the researcher is presented with many choices – what Gelman and Loken call “the garden of forking paths” after a short story by Argentinian author Jorge Luis (Borges, 1941/1964) – that can profoundly influence the results obtained. Some of these, such as cleaning data, or eliminating outliers, seem innocent, but nevertheless, because these decisions are taken after the results are seen, they are inconsistent with the assumptions of nullhypothesis significance testing. Other, more egregious, examples include outcome switching, collecting additional data, or changing the analytical approach when the desired level of statistical significance is not reached. A good example of how these issues play out in practice is provided by Bokhove (2022) in his replication of a study on gender differences in computer literacy, where he found that different, reasonable, analytical choices lead to very different conclusions.","PeriodicalId":47025,"journal":{"name":"Educational Research and Evaluation","volume":"27 1","pages":"208 - 214"},"PeriodicalIF":2.3000,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Research and Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/13803611.2021.2022309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 3

Abstract

For anyone who understands the logic of null-hypothesis significance testing, the so-called “replication crisis” in the behavioural sciences (Bryan et al., 2021) would not have come as much of a surprise. Since the pioneering work of Carlo Bonferroni (1935) – and subsequent work in the 1950s by Henry Scheffé (1953), John Tukey (1953/1994), and Olive Jean Dunn (1961) – statisticians have repeatedly pointed out the logically obvious fact that the probability of making a Type I error (mistakenly rejecting the null hypothesis) increases when multiple comparisons are made. And yet, studies in leading psychology and education journals commonly present dozens if not hundreds of comparisons of means, correlations, or other statistics, and then go on to claim that any statistic that has a probability of less than 0.05 is “significant”. However, as Gelman and Loken (2013) point out, even when researchers do not engage in such “fishing expeditions”, if decisions about the analysis are made after the data are collected – “hypothesizing after results are known” or “HARKing” (Kerr, 1998) – then the probability of Type 1 errors is increased. At each stage in the analysis, the researcher is presented with many choices – what Gelman and Loken call “the garden of forking paths” after a short story by Argentinian author Jorge Luis (Borges, 1941/1964) – that can profoundly influence the results obtained. Some of these, such as cleaning data, or eliminating outliers, seem innocent, but nevertheless, because these decisions are taken after the results are seen, they are inconsistent with the assumptions of nullhypothesis significance testing. Other, more egregious, examples include outcome switching, collecting additional data, or changing the analytical approach when the desired level of statistical significance is not reached. A good example of how these issues play out in practice is provided by Bokhove (2022) in his replication of a study on gender differences in computer literacy, where he found that different, reasonable, analytical choices lead to very different conclusions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

教育研究应如何应对社会科学的复制“危机”?对特刊论文的思考

对于任何理解零假设显著性检验逻辑的人来说，行为科学中所谓的“复制危机”(Bryan et al.， 2021)都不会让人感到意外。自从Carlo Bonferroni(1935)的开创性工作，以及20世纪50年代Henry scheff(1953)、John Tukey(1953/1994)和Olive Jean Dunn(1961)的后续工作以来，统计学家们一再指出一个逻辑上显而易见的事实，即当进行多次比较时，犯第一类错误(错误地拒绝零假设)的概率会增加。然而，主要的心理学和教育期刊上的研究通常会提出几十个(如果不是几百个的话)对平均值、相关性或其他统计数据的比较，然后继续声称任何概率小于0.05的统计数据都是“显著的”。然而，正如Gelman和Loken(2013)所指出的那样，即使研究人员不进行这种“钓鱼考察”，如果在收集数据后做出有关分析的决定-“在结果已知后假设”或“HARKing”(Kerr, 1998) -那么类型1错误的可能性就会增加。在分析的每个阶段，研究人员都会面临许多选择——Gelman和Loken以阿根廷作家Jorge Luis(博尔赫斯，1941/1964)的一个短篇小说命名，将其称为“分叉路径的花园”——这些选择会深刻地影响所获得的结果。其中一些，如清理数据或消除异常值，似乎是无害的，但是，由于这些决定是在看到结果之后做出的，因此它们与零假设显著性检验的假设不一致。其他，更令人震惊的例子包括结果转换，收集额外的数据，或者在没有达到预期的统计显著性水平时改变分析方法。Bokhove(2022)在复制一项关于计算机素养性别差异的研究中提供了一个很好的例子，他发现不同的、合理的、分析性的选择会导致非常不同的结论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Educational Research and Evaluation EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

3.00

自引率

0.00%

发文量

期刊介绍： International, comparative and multidisciplinary in scope, Educational Research and Evaluation (ERE) publishes original, peer-reviewed academic articles dealing with research on issues of worldwide relevance in educational practice. The aim of the journal is to increase understanding of learning in pre-primary, primary, high school, college, university and adult education, and to contribute to the improvement of educational processes and outcomes. The journal seeks to promote cross-national and international comparative educational research by publishing findings relevant to the scholarly community, as well as to practitioners and others interested in education. The scope of the journal is deliberately broad in terms of both topics covered and disciplinary perspective.