Pub Date : 2021-10-01DOI: 10.1177/25152459211061266
{"title":"Corrigendum: Simulation Studies as a Tool to Understand Bayes Factors","authors":"","doi":"10.1177/25152459211061266","DOIUrl":"https://doi.org/10.1177/25152459211061266","url":null,"abstract":"","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48159337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-23DOI: 10.1177/25152459221106366
Dominik Deffner, J. Rohrer, R. Mcelreath
Behavioral researchers increasingly recognize the need for more diverse samples that capture the breadth of human experience. Current attempts to establish generalizability across populations focus on threats to validity, constraints on generalization, and the accumulation of large, cross-cultural data sets. But for continued progress, we also require a framework that lets us determine which inferences can be drawn and how to make informative cross-cultural comparisons. We describe a generative causal-modeling framework and outline simple graphical criteria to derive analytic strategies and implied generalizations. Using both simulated and real data, we demonstrate how to project and compare estimates across populations and further show how to formally represent measurement equivalence or inequivalence across societies. We conclude with a discussion of how a formal framework for generalizability can assist researchers in designing more informative cross-cultural studies and thus provides a more solid foundation for cumulative and generalizable behavioral research.
{"title":"A Causal Framework for Cross-Cultural Generalizability","authors":"Dominik Deffner, J. Rohrer, R. Mcelreath","doi":"10.1177/25152459221106366","DOIUrl":"https://doi.org/10.1177/25152459221106366","url":null,"abstract":"Behavioral researchers increasingly recognize the need for more diverse samples that capture the breadth of human experience. Current attempts to establish generalizability across populations focus on threats to validity, constraints on generalization, and the accumulation of large, cross-cultural data sets. But for continued progress, we also require a framework that lets us determine which inferences can be drawn and how to make informative cross-cultural comparisons. We describe a generative causal-modeling framework and outline simple graphical criteria to derive analytic strategies and implied generalizations. Using both simulated and real data, we demonstrate how to project and compare estimates across populations and further show how to formally represent measurement equivalence or inequivalence across societies. We conclude with a discussion of how a formal framework for generalizability can assist researchers in designing more informative cross-cultural studies and thus provides a more solid foundation for cumulative and generalizable behavioral research.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":"5 1","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43042392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-27DOI: 10.1177/25152459211047228
Andrea L. Howard
This tutorial is aimed at researchers working with repeated measures or longitudinal data who are interested in enhancing their visualizations of model-implied mean-level trajectories plotted over time with confidence bands and raw data. The intended audience is researchers who are already modeling their experimental, observational, or other repeated measures data over time using random-effects regression or latent curve modeling but who lack a comprehensive guide to visualize trajectories over time. This tutorial uses an example plotting trajectories from two groups, as seen in random-effects models that include Time × Group interactions and latent curve models that regress the latent time slope factor onto a grouping variable. This tutorial is also geared toward researchers who are satisfied with their current software environment for modeling repeated measures data but who want to make graphics using R software. Prior knowledge of R is not assumed, and readers can follow along using data and other supporting materials available via OSF at https://osf.io/78bk5/. Readers should come away from this tutorial with the tools needed to begin visualizing mean trajectories over time from their own models and enhancing those plots with graphical estimates of uncertainty and raw data that adhere to transparent practices in research reporting.
本教程旨在研究重复测量或纵向数据的研究人员,他们有兴趣增强模型隐含的平均水平轨迹的可视化,这些轨迹随时间随置信带和原始数据绘制。目标受众是已经使用随机效应回归或潜在曲线建模对实验、观察或其他重复测量数据进行建模的研究人员,但他们缺乏全面的指导来可视化随时间变化的轨迹。本教程使用了一个从两个组绘制轨迹的示例,如在随机效应模型中看到的那样,随机效应模型包括Time x Group相互作用和将潜在时间斜率因子回归到分组变量的潜在曲线模型。本教程还面向那些对当前的重复测量数据建模软件环境感到满意,但想要使用R软件制作图形的研究人员。本文不要求读者具备R的先验知识,读者可以使用OSF网站https://osf.io/78bk5/上提供的数据和其他支持材料。读者应该从本教程中获得所需的工具,开始从他们自己的模型中可视化平均轨迹,并通过不确定性的图形估计和坚持透明实践的原始数据来增强这些图。
{"title":"A Guide to Visualizing Trajectories of Change With Confidence Bands and Raw Data","authors":"Andrea L. Howard","doi":"10.1177/25152459211047228","DOIUrl":"https://doi.org/10.1177/25152459211047228","url":null,"abstract":"This tutorial is aimed at researchers working with repeated measures or longitudinal data who are interested in enhancing their visualizations of model-implied mean-level trajectories plotted over time with confidence bands and raw data. The intended audience is researchers who are already modeling their experimental, observational, or other repeated measures data over time using random-effects regression or latent curve modeling but who lack a comprehensive guide to visualize trajectories over time. This tutorial uses an example plotting trajectories from two groups, as seen in random-effects models that include Time × Group interactions and latent curve models that regress the latent time slope factor onto a grouping variable. This tutorial is also geared toward researchers who are satisfied with their current software environment for modeling repeated measures data but who want to make graphics using R software. Prior knowledge of R is not assumed, and readers can follow along using data and other supporting materials available via OSF at https://osf.io/78bk5/. Readers should come away from this tutorial with the tools needed to begin visualizing mean trajectories over time from their own models and enhancing those plots with graphical estimates of uncertainty and raw data that adhere to transparent practices in research reporting.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46770629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-30DOI: 10.1177/25152459211061337
L. Tay, S. E. Woo, L. Hickman, Brandon M. Booth, Sidney K. D’Mello
Given significant concerns about fairness and bias in the use of artificial intelligence (AI) and machine learning (ML) for psychological assessment, we provide a conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) from a psychometric perspective. MLMB is defined as differential functioning of the trained ML model between subgroups. MLMB manifests empirically when a trained ML model produces different predicted score levels for different subgroups (e.g., race, gender) despite them having the same ground-truth levels for the underlying construct of interest (e.g., personality) and/or when the model yields differential predictive accuracies across the subgroups. Because the development of ML models involves both data and algorithms, both biased data and algorithm-training bias are potential sources of MLMB. Data bias can occur in the form of nonequivalence between subgroups in the ground truth, platform-based construct, behavioral expression, and/or feature computing. Algorithm-training bias can occur when algorithms are developed with nonequivalence in the relation between extracted features and ground truth (i.e., algorithm features are differentially used, weighted, or transformed between subgroups). We explain how these potential sources of bias may manifest during ML model development and share initial ideas for mitigating them, including recognizing that new statistical and algorithmic procedures need to be developed. We also discuss how this framework clarifies MLMB but does not reduce the complexity of the issue.
{"title":"A Conceptual Framework for Investigating and Mitigating Machine-Learning Measurement Bias (MLMB) in Psychological Assessment","authors":"L. Tay, S. E. Woo, L. Hickman, Brandon M. Booth, Sidney K. D’Mello","doi":"10.1177/25152459211061337","DOIUrl":"https://doi.org/10.1177/25152459211061337","url":null,"abstract":"Given significant concerns about fairness and bias in the use of artificial intelligence (AI) and machine learning (ML) for psychological assessment, we provide a conceptual framework for investigating and mitigating machine-learning measurement bias (MLMB) from a psychometric perspective. MLMB is defined as differential functioning of the trained ML model between subgroups. MLMB manifests empirically when a trained ML model produces different predicted score levels for different subgroups (e.g., race, gender) despite them having the same ground-truth levels for the underlying construct of interest (e.g., personality) and/or when the model yields differential predictive accuracies across the subgroups. Because the development of ML models involves both data and algorithms, both biased data and algorithm-training bias are potential sources of MLMB. Data bias can occur in the form of nonequivalence between subgroups in the ground truth, platform-based construct, behavioral expression, and/or feature computing. Algorithm-training bias can occur when algorithms are developed with nonequivalence in the relation between extracted features and ground truth (i.e., algorithm features are differentially used, weighted, or transformed between subgroups). We explain how these potential sources of bias may manifest during ML model development and share initial ideas for mitigating them, including recognizing that new statistical and algorithmic procedures need to be developed. We also discuss how this framework clarifies MLMB but does not reduce the complexity of the issue.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47805881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1177/25152459211027575
Emily R. Fyfe, J. D. Leeuw, Paulo F. Carvalho, Robert L. Goldstone, Janelle Sherman, D. Admiraal, Laura K. Alford, Alison Bonner, C. Brassil, Christopher A. Brooks, Tracey Carbonetto, Sau Hou Chang, Laura Cruz, Melina T. Czymoniewicz-Klippel, F. Daniel, M. Driessen, Noel Habashy, Carrie Hanson-Bradley, E. Hirt, Virginia Hojas Carbonell, Daniel K. Jackson, Shay Jones, Jennifer L. Keagy, Brandi Keith, Sarah J. Malmquist, B. McQuarrie, K. Metzger, Maung Min, S. Patil, Ryan S. Patrick, Etienne Pelaprat, Maureen L. Petrunich-Rutherford, Meghan R. Porter, Kristina K. Prescott, Cathrine Reck, Terri Renner, E. Robbins, Adam R. Smith, P. Stuczynski, J. Thompson, N. Tsotakos, J. Turk, Kyle Unruh, Jennifer Webb, S. Whitehead, E. Wisniewski, Ke Anne Zhang, Benjamin A. Motz
Psychology researchers have long attempted to identify educational practices that improve student learning. However, experimental research on these practices is often conducted in laboratory contexts or in a single course, which threatens the external validity of the results. In this article, we establish an experimental paradigm for evaluating the benefits of recommended practices across a variety of authentic educational contexts—a model we call ManyClasses. The core feature is that researchers examine the same research question and measure the same experimental effect across many classes spanning a range of topics, institutions, teacher implementations, and student populations. We report the first ManyClasses study, in which we examined how the timing of feedback on class assignments, either immediate or delayed by a few days, affected subsequent performance on class assessments. Across 38 classes, the overall estimate for the effect of feedback timing was 0.002 (95% highest density interval = [−0.05, 0.05]), which indicates that there was no effect of immediate feedback compared with delayed feedback on student learning that generalizes across classes. Furthermore, there were no credibly nonzero effects for 40 preregistered moderators related to class-level and student-level characteristics. Yet our results provide hints that in certain kinds of classes, which were undersampled in the current study, there may be modest advantages for delayed feedback. More broadly, these findings provide insights regarding the feasibility of conducting within-class randomized experiments across a range of naturally occurring learning environments.
{"title":"ManyClasses 1: Assessing the Generalizable Effect of Immediate Feedback Versus Delayed Feedback Across Many College Classes","authors":"Emily R. Fyfe, J. D. Leeuw, Paulo F. Carvalho, Robert L. Goldstone, Janelle Sherman, D. Admiraal, Laura K. Alford, Alison Bonner, C. Brassil, Christopher A. Brooks, Tracey Carbonetto, Sau Hou Chang, Laura Cruz, Melina T. Czymoniewicz-Klippel, F. Daniel, M. Driessen, Noel Habashy, Carrie Hanson-Bradley, E. Hirt, Virginia Hojas Carbonell, Daniel K. Jackson, Shay Jones, Jennifer L. Keagy, Brandi Keith, Sarah J. Malmquist, B. McQuarrie, K. Metzger, Maung Min, S. Patil, Ryan S. Patrick, Etienne Pelaprat, Maureen L. Petrunich-Rutherford, Meghan R. Porter, Kristina K. Prescott, Cathrine Reck, Terri Renner, E. Robbins, Adam R. Smith, P. Stuczynski, J. Thompson, N. Tsotakos, J. Turk, Kyle Unruh, Jennifer Webb, S. Whitehead, E. Wisniewski, Ke Anne Zhang, Benjamin A. Motz","doi":"10.1177/25152459211027575","DOIUrl":"https://doi.org/10.1177/25152459211027575","url":null,"abstract":"Psychology researchers have long attempted to identify educational practices that improve student learning. However, experimental research on these practices is often conducted in laboratory contexts or in a single course, which threatens the external validity of the results. In this article, we establish an experimental paradigm for evaluating the benefits of recommended practices across a variety of authentic educational contexts—a model we call ManyClasses. The core feature is that researchers examine the same research question and measure the same experimental effect across many classes spanning a range of topics, institutions, teacher implementations, and student populations. We report the first ManyClasses study, in which we examined how the timing of feedback on class assignments, either immediate or delayed by a few days, affected subsequent performance on class assessments. Across 38 classes, the overall estimate for the effect of feedback timing was 0.002 (95% highest density interval = [−0.05, 0.05]), which indicates that there was no effect of immediate feedback compared with delayed feedback on student learning that generalizes across classes. Furthermore, there were no credibly nonzero effects for 40 preregistered moderators related to class-level and student-level characteristics. Yet our results provide hints that in certain kinds of classes, which were undersampled in the current study, there may be modest advantages for delayed feedback. More broadly, these findings provide insights regarding the feasibility of conducting within-class randomized experiments across a range of naturally occurring learning environments.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/25152459211027575","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45755906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-07-01DOI: 10.1177/25152459211035109
D. Cousineau, Marc-André Goulet, Bradley Harding
Plotting the data of an experiment allows researchers to illustrate the main results of a study, show effect sizes, compare conditions, and guide interpretations. To achieve all this, it is necessary to show point estimates of the results and their precision using error bars. Often, and potentially unbeknownst to them, researchers use a type of error bars—the confidence intervals—that convey limited information. For instance, confidence intervals do not allow comparing results (a) between groups, (b) between repeated measures, (c) when participants are sampled in clusters, and (d) when the population size is finite. The use of such stand-alone error bars can lead to discrepancies between the plot’s display and the conclusions derived from statistical tests. To overcome this problem, we propose to generalize the precision of the results (the confidence intervals) by adjusting them so that they take into account the experimental design and the sampling methodology. Unfortunately, most software dedicated to statistical analyses do not offer options to adjust error bars. As a solution, we developed an open-access, open-source library for R—superb—that allows users to create summary plots with easily adjusted error bars.
{"title":"Summary Plots With Adjusted Error Bars: The superb Framework With an Implementation in R","authors":"D. Cousineau, Marc-André Goulet, Bradley Harding","doi":"10.1177/25152459211035109","DOIUrl":"https://doi.org/10.1177/25152459211035109","url":null,"abstract":"Plotting the data of an experiment allows researchers to illustrate the main results of a study, show effect sizes, compare conditions, and guide interpretations. To achieve all this, it is necessary to show point estimates of the results and their precision using error bars. Often, and potentially unbeknownst to them, researchers use a type of error bars—the confidence intervals—that convey limited information. For instance, confidence intervals do not allow comparing results (a) between groups, (b) between repeated measures, (c) when participants are sampled in clusters, and (d) when the population size is finite. The use of such stand-alone error bars can lead to discrepancies between the plot’s display and the conclusions derived from statistical tests. To overcome this problem, we propose to generalize the precision of the results (the confidence intervals) by adjusting them so that they take into account the experimental design and the sampling methodology. Unfortunately, most software dedicated to statistical analyses do not offer options to adjust error bars. As a solution, we developed an open-access, open-source library for R—superb—that allows users to create summary plots with easily adjusted error bars.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":"4 1","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42212519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-14DOI: 10.1177/25152459221080396
Maximilian Maier, D. Lakens
The default use of an alpha level of .05 is suboptimal for two reasons. First, decisions based on data can be made more efficiently by choosing an alpha level that minimizes the combined Type 1 and Type 2 error rate. Second, it is possible that in studies with very high statistical power, p values lower than the alpha level can be more likely when the null hypothesis is true than when the alternative hypothesis is true (i.e., Lindley’s paradox). In this article, we explain two approaches that can be used to justify a better choice of an alpha level than relying on the default threshold of .05. The first approach is based on the idea to either minimize or balance Type 1 and Type 2 error rates. The second approach lowers the alpha level as a function of the sample size to prevent Lindley’s paradox. An R package and Shiny app are provided to perform the required calculations. Both approaches have their limitations (e.g., the challenge of specifying relative costs and priors) but can offer an improvement to current practices, especially when sample sizes are large. The use of alpha levels that are better justified should improve statistical inferences and can increase the efficiency and informativeness of scientific research.
{"title":"Justify Your Alpha: A Primer on Two Practical Approaches","authors":"Maximilian Maier, D. Lakens","doi":"10.1177/25152459221080396","DOIUrl":"https://doi.org/10.1177/25152459221080396","url":null,"abstract":"The default use of an alpha level of .05 is suboptimal for two reasons. First, decisions based on data can be made more efficiently by choosing an alpha level that minimizes the combined Type 1 and Type 2 error rate. Second, it is possible that in studies with very high statistical power, p values lower than the alpha level can be more likely when the null hypothesis is true than when the alternative hypothesis is true (i.e., Lindley’s paradox). In this article, we explain two approaches that can be used to justify a better choice of an alpha level than relying on the default threshold of .05. The first approach is based on the idea to either minimize or balance Type 1 and Type 2 error rates. The second approach lowers the alpha level as a function of the sample size to prevent Lindley’s paradox. An R package and Shiny app are provided to perform the required calculations. Both approaches have their limitations (e.g., the challenge of specifying relative costs and priors) but can offer an improvement to current practices, especially when sample sizes are large. The use of alpha levels that are better justified should improve statistical inferences and can increase the efficiency and informativeness of scientific research.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43639586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-01DOI: 10.1177/25152459211012271
A. Georgeson, Matthew J. Valente, Oscar Gonzalez
Researchers and prevention scientists often develop interventions to target intermediate variables (known as mediators) that are thought to be related to an outcome. When researchers target a mediating construct measured by self-report, the meaning of the self-report measure could change from pretest to posttest for the individuals who received the intervention—which is a phenomenon referred to as response shift. As a result, any observed changes on the mediator measure across groups or across time might reflect a combination of true change on the construct and response shift. Although previous studies have focused on identifying the source and type of response shift in measures after an intervention, there has been limited research on how using sum scores in the presence of response shift affects the estimation of mediated effects via statistical mediation analysis, which is critical for explaining how the intervention worked. In this article, we focus on recalibration response shift, which is a change in internal standards of measurement and affects how respondents interpret the response scale. We provide background on the theory of response shift and the methodology used to detect response shift (i.e., tests of measurement invariance). In addition, we used simulated data sets to provide an illustration of how recalibration in the mediator can bias estimates of the mediated effect and affect Type I error and power.
{"title":"Evaluating Response Shift in Statistical Mediation Analysis","authors":"A. Georgeson, Matthew J. Valente, Oscar Gonzalez","doi":"10.1177/25152459211012271","DOIUrl":"https://doi.org/10.1177/25152459211012271","url":null,"abstract":"Researchers and prevention scientists often develop interventions to target intermediate variables (known as mediators) that are thought to be related to an outcome. When researchers target a mediating construct measured by self-report, the meaning of the self-report measure could change from pretest to posttest for the individuals who received the intervention—which is a phenomenon referred to as response shift. As a result, any observed changes on the mediator measure across groups or across time might reflect a combination of true change on the construct and response shift. Although previous studies have focused on identifying the source and type of response shift in measures after an intervention, there has been limited research on how using sum scores in the presence of response shift affects the estimation of mediated effects via statistical mediation analysis, which is critical for explaining how the intervention worked. In this article, we focus on recalibration response shift, which is a change in internal standards of measurement and affects how respondents interpret the response scale. We provide background on the theory of response shift and the methodology used to detect response shift (i.e., tests of measurement invariance). In addition, we used simulated data sets to provide an illustration of how recalibration in the mediator can bias estimates of the mediated effect and affect Type I error and power.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/25152459211012271","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44476513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-01DOI: 10.1177/2515245920970949
D. Lakens, L. DeBruine
Making scientific information machine-readable greatly facilitates its reuse. Many scientific articles have the goal to test a hypothesis, so making the tests of statistical predictions easier to find and access could be very beneficial. We propose an approach that can be used to make hypothesis tests machine-readable. We believe there are two benefits to specifying a hypothesis test in such a way that a computer can evaluate whether the statistical prediction is corroborated or not. First, hypothesis tests become more transparent, falsifiable, and rigorous. Second, scientists benefit if information related to hypothesis tests in scientific articles is easily findable and reusable, for example, to perform meta-analyses, conduct peer review, and examine metascientific research questions. We examine what a machine-readable hypothesis test should look like and demonstrate the feasibility of machine-readable hypothesis tests in a real-life example using the fully operational prototype R package scienceverse.
{"title":"Improving Transparency, Falsifiability, and Rigor by Making Hypothesis Tests Machine-Readable","authors":"D. Lakens, L. DeBruine","doi":"10.1177/2515245920970949","DOIUrl":"https://doi.org/10.1177/2515245920970949","url":null,"abstract":"Making scientific information machine-readable greatly facilitates its reuse. Many scientific articles have the goal to test a hypothesis, so making the tests of statistical predictions easier to find and access could be very beneficial. We propose an approach that can be used to make hypothesis tests machine-readable. We believe there are two benefits to specifying a hypothesis test in such a way that a computer can evaluate whether the statistical prediction is corroborated or not. First, hypothesis tests become more transparent, falsifiable, and rigorous. Second, scientists benefit if information related to hypothesis tests in scientific articles is easily findable and reusable, for example, to perform meta-analyses, conduct peer review, and examine metascientific research questions. We examine what a machine-readable hypothesis test should look like and demonstrate the feasibility of machine-readable hypothesis tests in a real-life example using the fully operational prototype R package scienceverse.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/2515245920970949","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44442991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-01DOI: 10.1177/2515245921999602
J. Karch
To investigate whether a variable tends to be larger in one population than in another, the t test is the standard procedure. In some situations, the parametric t test is inappropriate, and a nonparametric procedure should be used instead. The default nonparametric procedure is Mann-Whitney’s U test. Despite being a nonparametric test, Mann-Whitney’s test is associated with a strong assumption, known as exchangeability. I demonstrate that if exchangeability is violated, Mann-Whitney’s test can lead to wrong statistical inferences even for large samples. In addition, I argue that in psychology, exchangeability is typically not met. As a remedy, I introduce Brunner-Munzel’s test and demonstrate that it provides good Type I error rate control even if exchangeability is not met and that it has similar power as Mann-Whitney’s test. Consequently, I recommend using Brunner-Munzel’s test by default. To facilitate this, I provide advice on how to perform and report on Brunner-Munzel’s test.
{"title":"Psychologists Should Use Brunner-Munzel’s Instead of Mann-Whitney’s U Test as the Default Nonparametric Procedure","authors":"J. Karch","doi":"10.1177/2515245921999602","DOIUrl":"https://doi.org/10.1177/2515245921999602","url":null,"abstract":"To investigate whether a variable tends to be larger in one population than in another, the t test is the standard procedure. In some situations, the parametric t test is inappropriate, and a nonparametric procedure should be used instead. The default nonparametric procedure is Mann-Whitney’s U test. Despite being a nonparametric test, Mann-Whitney’s test is associated with a strong assumption, known as exchangeability. I demonstrate that if exchangeability is violated, Mann-Whitney’s test can lead to wrong statistical inferences even for large samples. In addition, I argue that in psychology, exchangeability is typically not met. As a remedy, I introduce Brunner-Munzel’s test and demonstrate that it provides good Type I error rate control even if exchangeability is not met and that it has similar power as Mann-Whitney’s test. Consequently, I recommend using Brunner-Munzel’s test by default. To facilitate this, I provide advice on how to perform and report on Brunner-Munzel’s test.","PeriodicalId":55645,"journal":{"name":"Advances in Methods and Practices in Psychological Science","volume":" ","pages":""},"PeriodicalIF":13.6,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/2515245921999602","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44939061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}