{"title":"How should educational research respond to the replication “crisis” in the social sciences? Reflections on the papers in the Special Issue","authors":"D. Wiliam","doi":"10.1080/13803611.2021.2022309","DOIUrl":null,"url":null,"abstract":"For anyone who understands the logic of null-hypothesis significance testing, the so-called “replication crisis” in the behavioural sciences (Bryan et al., 2021) would not have come as much of a surprise. Since the pioneering work of Carlo Bonferroni (1935) – and subsequent work in the 1950s by Henry Scheffé (1953), John Tukey (1953/1994), and Olive Jean Dunn (1961) – statisticians have repeatedly pointed out the logically obvious fact that the probability of making a Type I error (mistakenly rejecting the null hypothesis) increases when multiple comparisons are made. And yet, studies in leading psychology and education journals commonly present dozens if not hundreds of comparisons of means, correlations, or other statistics, and then go on to claim that any statistic that has a probability of less than 0.05 is “significant”. However, as Gelman and Loken (2013) point out, even when researchers do not engage in such “fishing expeditions”, if decisions about the analysis are made after the data are collected – “hypothesizing after results are known” or “HARKing” (Kerr, 1998) – then the probability of Type 1 errors is increased. At each stage in the analysis, the researcher is presented with many choices – what Gelman and Loken call “the garden of forking paths” after a short story by Argentinian author Jorge Luis (Borges, 1941/1964) – that can profoundly influence the results obtained. Some of these, such as cleaning data, or eliminating outliers, seem innocent, but nevertheless, because these decisions are taken after the results are seen, they are inconsistent with the assumptions of nullhypothesis significance testing. Other, more egregious, examples include outcome switching, collecting additional data, or changing the analytical approach when the desired level of statistical significance is not reached. A good example of how these issues play out in practice is provided by Bokhove (2022) in his replication of a study on gender differences in computer literacy, where he found that different, reasonable, analytical choices lead to very different conclusions.","PeriodicalId":47025,"journal":{"name":"Educational Research and Evaluation","volume":"27 1","pages":"208 - 214"},"PeriodicalIF":2.3000,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational Research and Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/13803611.2021.2022309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 3
Abstract
For anyone who understands the logic of null-hypothesis significance testing, the so-called “replication crisis” in the behavioural sciences (Bryan et al., 2021) would not have come as much of a surprise. Since the pioneering work of Carlo Bonferroni (1935) – and subsequent work in the 1950s by Henry Scheffé (1953), John Tukey (1953/1994), and Olive Jean Dunn (1961) – statisticians have repeatedly pointed out the logically obvious fact that the probability of making a Type I error (mistakenly rejecting the null hypothesis) increases when multiple comparisons are made. And yet, studies in leading psychology and education journals commonly present dozens if not hundreds of comparisons of means, correlations, or other statistics, and then go on to claim that any statistic that has a probability of less than 0.05 is “significant”. However, as Gelman and Loken (2013) point out, even when researchers do not engage in such “fishing expeditions”, if decisions about the analysis are made after the data are collected – “hypothesizing after results are known” or “HARKing” (Kerr, 1998) – then the probability of Type 1 errors is increased. At each stage in the analysis, the researcher is presented with many choices – what Gelman and Loken call “the garden of forking paths” after a short story by Argentinian author Jorge Luis (Borges, 1941/1964) – that can profoundly influence the results obtained. Some of these, such as cleaning data, or eliminating outliers, seem innocent, but nevertheless, because these decisions are taken after the results are seen, they are inconsistent with the assumptions of nullhypothesis significance testing. Other, more egregious, examples include outcome switching, collecting additional data, or changing the analytical approach when the desired level of statistical significance is not reached. A good example of how these issues play out in practice is provided by Bokhove (2022) in his replication of a study on gender differences in computer literacy, where he found that different, reasonable, analytical choices lead to very different conclusions.
期刊介绍:
International, comparative and multidisciplinary in scope, Educational Research and Evaluation (ERE) publishes original, peer-reviewed academic articles dealing with research on issues of worldwide relevance in educational practice. The aim of the journal is to increase understanding of learning in pre-primary, primary, high school, college, university and adult education, and to contribute to the improvement of educational processes and outcomes. The journal seeks to promote cross-national and international comparative educational research by publishing findings relevant to the scholarly community, as well as to practitioners and others interested in education. The scope of the journal is deliberately broad in terms of both topics covered and disciplinary perspective.