Pub Date : 2026-01-23DOI: 10.3758/s13428-025-02860-7
Erin M Buchanan, Mahmoud M Elsherif, Jason Geller, Chris L Aberson, Necdet Gurkan, Ettore Ambrosini, Tom Heyman, Maria Montefinese, Wolf Vanpaemel, Krystian Barzykowski, Carlota Batres, Katharina Fellnhofer, Guanxiong Huang, Joseph McFall, Gianni Ribeiro, Jan P Röer, José L Ulloa, Timo B Roettger, K D Valentine, Antonino Visalli, Kathleen Schmidt, Martin R Vasilev, Giada Viviani, Jacob F Miranda, Savannah C Lewis
The planning of sample size for research studies often focuses on obtaining a significant result given a specified level of power, significance, and an anticipated effect size. This planning requires prior knowledge of the study design and a statistical analysis to calculate the proposed sample size. However, there may not be one specific testable analysis from which to derive power (Silberzahn et al., Advances in Methods and Practices in Psychological Science, 1(3), 337356, 2018) or a hypothesis to test for the project (e.g., creation of a stimuli database). Modern power and sample size planning suggestions include accuracy in parameter estimation (AIPE, Kelley, Behavior Research Methods, 39(4), 755-766, 2007; Maxell et al., Annual Review of Psychology, 59, 537-563, 2008) and simulation of proposed analyses (Chalmers & Adkins, The Quantitative Methods for Psychology, 16(4), 248-280, 2020). These toolkits offer flexibility in traditional power analyses that focus on the if-this, then-that approach. However, both AIPE and simulation require either a specific parameter (e.g., mean, effect size, etc.) or a statistical test for planning sample size. In this tutorial, we explore how AIPE and simulation approaches can be combined to accommodate studies that may not have a specific hypothesis test or wish to account for the potential of a multiverse of analyses. Specifically, we focus on studies that use multiple items and suggest that sample sizes can be planned to measure those items adequately and precisely, regardless of the statistical test. This tutorial also provides multiple code vignettes and package functionality that researchers can adapt and apply to their own measures.
研究的样本量规划通常侧重于在给定的功率、显著性水平和预期效应大小的情况下获得显著的结果。这种计划需要事先了解研究设计和统计分析,以计算拟议的样本量。然而,可能没有一个特定的可测试分析可以从中获得力量(Silberzahn等人,《心理科学方法与实践的进展》,1(3),337356,2018)或一个假设来测试该项目(例如,创建刺激数据库)。现代功率和样本量规划建议包括参数估计的准确性(AIPE, Kelley,行为研究方法,39(4),755-766,2007;Maxell et al.,《心理学年度评论》,2008年第59期,537-563页),以及拟议分析的模拟(Chalmers & Adkins,《心理学的定量方法》,16(4),248- 280,2020)。这些工具包为传统的功率分析提供了灵活性,这些分析侧重于if-this, then-that方法。然而,AIPE和模拟都需要特定的参数(例如,平均值、效应大小等)或规划样本量的统计检验。在本教程中,我们将探讨如何将AIPE和模拟方法结合起来,以适应可能没有特定假设检验或希望解释多元宇宙分析潜力的研究。具体来说,我们关注的是使用多个项目的研究,并建议可以计划样本量,以充分和准确地测量这些项目,而不考虑统计测试。本教程还提供了多个代码片段和包功能,研究人员可以将其应用到自己的度量中。
{"title":"Accuracy in parameter estimation and simulation approaches for sample-size planning accounting for item effects.","authors":"Erin M Buchanan, Mahmoud M Elsherif, Jason Geller, Chris L Aberson, Necdet Gurkan, Ettore Ambrosini, Tom Heyman, Maria Montefinese, Wolf Vanpaemel, Krystian Barzykowski, Carlota Batres, Katharina Fellnhofer, Guanxiong Huang, Joseph McFall, Gianni Ribeiro, Jan P Röer, José L Ulloa, Timo B Roettger, K D Valentine, Antonino Visalli, Kathleen Schmidt, Martin R Vasilev, Giada Viviani, Jacob F Miranda, Savannah C Lewis","doi":"10.3758/s13428-025-02860-7","DOIUrl":"10.3758/s13428-025-02860-7","url":null,"abstract":"<p><p>The planning of sample size for research studies often focuses on obtaining a significant result given a specified level of power, significance, and an anticipated effect size. This planning requires prior knowledge of the study design and a statistical analysis to calculate the proposed sample size. However, there may not be one specific testable analysis from which to derive power (Silberzahn et al., Advances in Methods and Practices in Psychological Science, 1(3), 337356, 2018) or a hypothesis to test for the project (e.g., creation of a stimuli database). Modern power and sample size planning suggestions include accuracy in parameter estimation (AIPE, Kelley, Behavior Research Methods, 39(4), 755-766, 2007; Maxell et al., Annual Review of Psychology, 59, 537-563, 2008) and simulation of proposed analyses (Chalmers & Adkins, The Quantitative Methods for Psychology, 16(4), 248-280, 2020). These toolkits offer flexibility in traditional power analyses that focus on the if-this, then-that approach. However, both AIPE and simulation require either a specific parameter (e.g., mean, effect size, etc.) or a statistical test for planning sample size. In this tutorial, we explore how AIPE and simulation approaches can be combined to accommodate studies that may not have a specific hypothesis test or wish to account for the potential of a multiverse of analyses. Specifically, we focus on studies that use multiple items and suggest that sample sizes can be planned to measure those items adequately and precisely, regardless of the statistical test. This tutorial also provides multiple code vignettes and package functionality that researchers can adapt and apply to their own measures.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"48"},"PeriodicalIF":3.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12830498/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.3758/s13428-025-02920-y
Theodoros A Kyriazos, Mary Poga
Analytical flexibility is an inherent feature of quantitative research that, when exercised without constraint, transparency, or strong theoretical justification, produces systematic bias and undermines inferential validity. This article presents a conceptual and computational framework identifying 10 particularly impactful and prevalent questionable research practices (QRPs) that exemplify how hidden flexibility distorts scientific conclusions across four stages of the research workflow. Rather than proposing a new taxonomy, we operationalize a targeted subset of QRPs into a conceptual framework that links each practice to its underlying bias mechanism. We further map these mechanisms to 10 evidence-based corrective strategies designed to mitigate the specific inferential violations each practice produces. To support education and diagnostic exploration, we present a reproducible R-based simulation suite that allows researchers to examine the impact of QRPs and prevention strategies across context-specific design parameters. This framework contributes to research integrity by offering a theory-based, stage-specific, and simulation-supported approach to identifying, understanding, and preventing the most consequential forms of hidden analytical flexibility in quantitative research.
{"title":"Ten particularly frequent and consequential questionable research practices in quantitative research: Bias mechanisms, preventive strategies, and a simulation-based framework.","authors":"Theodoros A Kyriazos, Mary Poga","doi":"10.3758/s13428-025-02920-y","DOIUrl":"https://doi.org/10.3758/s13428-025-02920-y","url":null,"abstract":"<p><p>Analytical flexibility is an inherent feature of quantitative research that, when exercised without constraint, transparency, or strong theoretical justification, produces systematic bias and undermines inferential validity. This article presents a conceptual and computational framework identifying 10 particularly impactful and prevalent questionable research practices (QRPs) that exemplify how hidden flexibility distorts scientific conclusions across four stages of the research workflow. Rather than proposing a new taxonomy, we operationalize a targeted subset of QRPs into a conceptual framework that links each practice to its underlying bias mechanism. We further map these mechanisms to 10 evidence-based corrective strategies designed to mitigate the specific inferential violations each practice produces. To support education and diagnostic exploration, we present a reproducible R-based simulation suite that allows researchers to examine the impact of QRPs and prevention strategies across context-specific design parameters. This framework contributes to research integrity by offering a theory-based, stage-specific, and simulation-supported approach to identifying, understanding, and preventing the most consequential forms of hidden analytical flexibility in quantitative research.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"46"},"PeriodicalIF":3.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.3758/s13428-025-02940-8
Ariel Levy, Tali Kleiman, Yuval Hart
Categorization studies, in which stimuli vary along a category continuum, are becoming increasingly popular in psychological science. These studies demonstrate the effect of category ambiguity on various behavioral and neural measures. In such studies, researchers manipulate objective category levels by varying the physical properties of the stimuli, and then use these levels as predictors of behavior-assuming they map directly onto participants' perceived locations along the category continuum. This approach might not be optimal, considering the variability in participants' category boundary locations (their point of subjective equality, or PSE). In this tutorial, we propose addressing this issue by estimating participants' individual points of subjective equality, adjusting category levels relative to these points, and conducting statistical analyses on the subjective category levels. Implementing this method significantly improves the statistical power of the analysis in both experimental and simulated data. Adjusting stimulus levels by the points of subjective equality is highly suited for social categorization studies, in which points of subjective equality vary significantly. On a broader scale, it can be applied to a variety of categorization, discrimination, and decision-making studies.
{"title":"The point of subjective equality as a tool for accurate and robust analysis in categorization tasks.","authors":"Ariel Levy, Tali Kleiman, Yuval Hart","doi":"10.3758/s13428-025-02940-8","DOIUrl":"10.3758/s13428-025-02940-8","url":null,"abstract":"<p><p>Categorization studies, in which stimuli vary along a category continuum, are becoming increasingly popular in psychological science. These studies demonstrate the effect of category ambiguity on various behavioral and neural measures. In such studies, researchers manipulate objective category levels by varying the physical properties of the stimuli, and then use these levels as predictors of behavior-assuming they map directly onto participants' perceived locations along the category continuum. This approach might not be optimal, considering the variability in participants' category boundary locations (their point of subjective equality, or PSE). In this tutorial, we propose addressing this issue by estimating participants' individual points of subjective equality, adjusting category levels relative to these points, and conducting statistical analyses on the subjective category levels. Implementing this method significantly improves the statistical power of the analysis in both experimental and simulated data. Adjusting stimulus levels by the points of subjective equality is highly suited for social categorization studies, in which points of subjective equality vary significantly. On a broader scale, it can be applied to a variety of categorization, discrimination, and decision-making studies.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"50"},"PeriodicalIF":3.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.3758/s13428-025-02916-8
Alon Zivony, Claudia C von Bastian, Rachel Pye
How quickly we attend to objects plays an important role in navigating the world, especially in dynamic and rapidly changing environments. Measuring individual differences in attention speed is therefore an important, yet challenging, task. Although reaction times in visual search tasks have often been used as an intuitive proxy of such individual differences, these measures are limited by inconsistent levels of reliability and contamination by non-attentional factors. This study introduces the rate of post-target distractor intrusions (DI) in the rapid serial visual presentation (RSVP) paradigm as an alternative method of studying individual differences in the speed of attention. In RSVP, a target is presented for a brief duration and embedded among multiple distractors. DIs are reports of a subsequent distractor rather than the target and have previously been shown to be associated with the speed of attention. The present study explored the reliability and validity of DI rates as a measure of individual differences. In three studies, DI rates showed high internal consistency and test-retest reliability over a year (>.90), even with a short task administration of only about 5 minutes. Moreover, DI rates were associated with measures related to attention speed, but not with unrelated measures of attentional control, reading speed, and attentional blink effects. Taken together, DI rates can serve as a useful tool for research into individual differences in the speed of attention. Links to a downloadable and easily executable DI experiment, as well as a brief discussion of methodological considerations, are provided to facilitate such future research.
{"title":"Measuring individual differences in the speed of attention using the distractor intrusion task.","authors":"Alon Zivony, Claudia C von Bastian, Rachel Pye","doi":"10.3758/s13428-025-02916-8","DOIUrl":"10.3758/s13428-025-02916-8","url":null,"abstract":"<p><p>How quickly we attend to objects plays an important role in navigating the world, especially in dynamic and rapidly changing environments. Measuring individual differences in attention speed is therefore an important, yet challenging, task. Although reaction times in visual search tasks have often been used as an intuitive proxy of such individual differences, these measures are limited by inconsistent levels of reliability and contamination by non-attentional factors. This study introduces the rate of post-target distractor intrusions (DI) in the rapid serial visual presentation (RSVP) paradigm as an alternative method of studying individual differences in the speed of attention. In RSVP, a target is presented for a brief duration and embedded among multiple distractors. DIs are reports of a subsequent distractor rather than the target and have previously been shown to be associated with the speed of attention. The present study explored the reliability and validity of DI rates as a measure of individual differences. In three studies, DI rates showed high internal consistency and test-retest reliability over a year (>.90), even with a short task administration of only about 5 minutes. Moreover, DI rates were associated with measures related to attention speed, but not with unrelated measures of attentional control, reading speed, and attentional blink effects. Taken together, DI rates can serve as a useful tool for research into individual differences in the speed of attention. Links to a downloadable and easily executable DI experiment, as well as a brief discussion of methodological considerations, are provided to facilitate such future research.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"47"},"PeriodicalIF":3.9,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12830438/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146040354","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.3758/s13428-025-02932-8
Kevin Kapadia, Yunxiu Tang, Richard John
The Columbia Card Task (CCT) is a behavioral measure of risk-taking (BMRT), which has been cited over 1,500 times (Google Scholar, 3/1/2024). The original game had two versions (Hot and Cold), measuring affective and deliberative decision-making, respectively. Each version included 54 scored rounds where the loss cards were placed at the end, and nine unscored rounds where the loss cards were placed systematically among the gain cards. Over time, the CCT has gone through many iterations on critical components, such as the number of rounds, the position of the loss cards, and the introduction of a new version (Warm). Despite this, there are several issues with the CCT, notably a need for convergent validity with other measures of risk-taking. This paper reviews different iterations of the CCT, introduces a new (Toasty) version of the CCT that is a hybrid of the hot and warm versions, explores the consequences of randomly placing the loss cards among the gain cards consistent with instructions provided to participants, and examines the impact of incentivizing participants based on their score. Results (N = 405) show that the Toasty version behaves similarly to the Warm but provides additional insights into risk-taking behavior. When loss cards are placed randomly, participants are still sensitive to the game's parameters (gain amount, loss amount, and number of loss cards) and reveal the loss cards roughly half the time. Incentivizing participants in our study had little impact on the number of cards revealed.
{"title":"Jokers in the deck: A new temperature setting for the Columbia Card Task.","authors":"Kevin Kapadia, Yunxiu Tang, Richard John","doi":"10.3758/s13428-025-02932-8","DOIUrl":"10.3758/s13428-025-02932-8","url":null,"abstract":"<p><p>The Columbia Card Task (CCT) is a behavioral measure of risk-taking (BMRT), which has been cited over 1,500 times (Google Scholar, 3/1/2024). The original game had two versions (Hot and Cold), measuring affective and deliberative decision-making, respectively. Each version included 54 scored rounds where the loss cards were placed at the end, and nine unscored rounds where the loss cards were placed systematically among the gain cards. Over time, the CCT has gone through many iterations on critical components, such as the number of rounds, the position of the loss cards, and the introduction of a new version (Warm). Despite this, there are several issues with the CCT, notably a need for convergent validity with other measures of risk-taking. This paper reviews different iterations of the CCT, introduces a new (Toasty) version of the CCT that is a hybrid of the hot and warm versions, explores the consequences of randomly placing the loss cards among the gain cards consistent with instructions provided to participants, and examines the impact of incentivizing participants based on their score. Results (N = 405) show that the Toasty version behaves similarly to the Warm but provides additional insights into risk-taking behavior. When loss cards are placed randomly, participants are still sensitive to the game's parameters (gain amount, loss amount, and number of loss cards) and reveal the loss cards roughly half the time. Incentivizing participants in our study had little impact on the number of cards revealed.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"45"},"PeriodicalIF":3.9,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12823667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146017218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.3758/s13428-025-02919-5
Marco Badioli, Claudio Danti, Luigi Degni, Gianluca Finotti, Valentina Bernardi, Lorenzo Mattioni, Francesca Starita, Giuseppe di Pellegrino, Sara Giovagnoli, Mariagrazia Benassi, Sara Garofalo
In animal research, reward-predictive cues shape behavior through Pavlovian conditioning, yet animals vary in the value they assign to these cues. Sign-trackers (ST) attribute both incentive and predictive values to the cues, orienting their attention to them, while goal-trackers (GT) assign solely predictive value, orienting their attention rapidly toward the forthcoming reward. Although most animal studies report sign-tracking and goal-tracking as stable, trait-like behavioral profiles, human research has produced inconsistent results, raising questions about the reliability and the stability of this behavior. To address these issues, we investigated the test-retest reliability and stability of the classification over a four-month period of the gaze index most frequently adopted in human sign-tracking and goal-tracking literature. Our findings revealed good stability for sign-tracking behavior, but limited consistency for goal-tracking behavior. These results raise the possibility that goal-tracking may be either genuinely rare in the population or poorly captured by the current index. Overall, while the gaze index holds promise for identifying sign-tracking behavior, methodological refinements or alternative approaches may be needed to more reliably detect these behaviors in future research.
{"title":"Test-retest reliability of the gaze index for sign-tracking and goal-tracking.","authors":"Marco Badioli, Claudio Danti, Luigi Degni, Gianluca Finotti, Valentina Bernardi, Lorenzo Mattioni, Francesca Starita, Giuseppe di Pellegrino, Sara Giovagnoli, Mariagrazia Benassi, Sara Garofalo","doi":"10.3758/s13428-025-02919-5","DOIUrl":"https://doi.org/10.3758/s13428-025-02919-5","url":null,"abstract":"<p><p>In animal research, reward-predictive cues shape behavior through Pavlovian conditioning, yet animals vary in the value they assign to these cues. Sign-trackers (ST) attribute both incentive and predictive values to the cues, orienting their attention to them, while goal-trackers (GT) assign solely predictive value, orienting their attention rapidly toward the forthcoming reward. Although most animal studies report sign-tracking and goal-tracking as stable, trait-like behavioral profiles, human research has produced inconsistent results, raising questions about the reliability and the stability of this behavior. To address these issues, we investigated the test-retest reliability and stability of the classification over a four-month period of the gaze index most frequently adopted in human sign-tracking and goal-tracking literature. Our findings revealed good stability for sign-tracking behavior, but limited consistency for goal-tracking behavior. These results raise the possibility that goal-tracking may be either genuinely rare in the population or poorly captured by the current index. Overall, while the gaze index holds promise for identifying sign-tracking behavior, methodological refinements or alternative approaches may be needed to more reliably detect these behaviors in future research.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"44"},"PeriodicalIF":3.9,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146008640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.3758/s13428-025-02923-9
Jérémie Beucler, Zoe Purcell, Lucie Charles, Wim De Neys
Accurately quantifying belief strength in heuristics-and-biases tasks is crucial yet methodologically challenging. In this paper, we introduce an automated method leveraging large language models (LLMs) to systematically measure and manipulate belief strength. We specifically tested this method in the widely used "lawyer-engineer" base-rate neglect task, in which stereotypical descriptions (e.g., someone enjoying mathematical puzzles) conflict with normative base-rate information (e.g., engineers represent a very small percentage of the sample). Using this approach, we created an open-access database containing over 100,000 unique items systematically varying in stereotype-driven belief strength. Validation studies demonstrate that our LLM-derived belief strength measure correlates strongly with human typicality ratings and robustly predicts human choices in a base-rate neglect task. Additionally, our method revealed substantial and previously unnoticed variability in stereotype-driven belief strength in popular base-rate items from existing research, underlining the need to control for this in future studies. We further highlight methodological improvements achievable by refining the LLM prompt, as well as ways to enhance cross-cultural validity. The database presented here serves as a powerful resource for researchers, facilitating rigorous, replicable, and theoretically precise experimental designs, as well as enabling advancements in cognitive and computational modeling of reasoning. To support its use, we provide the R package baserater, which allows researchers to access the database to apply or adapt the method to their own research.
{"title":"Using large language models to estimate belief strength in reasoning.","authors":"Jérémie Beucler, Zoe Purcell, Lucie Charles, Wim De Neys","doi":"10.3758/s13428-025-02923-9","DOIUrl":"https://doi.org/10.3758/s13428-025-02923-9","url":null,"abstract":"<p><p>Accurately quantifying belief strength in heuristics-and-biases tasks is crucial yet methodologically challenging. In this paper, we introduce an automated method leveraging large language models (LLMs) to systematically measure and manipulate belief strength. We specifically tested this method in the widely used \"lawyer-engineer\" base-rate neglect task, in which stereotypical descriptions (e.g., someone enjoying mathematical puzzles) conflict with normative base-rate information (e.g., engineers represent a very small percentage of the sample). Using this approach, we created an open-access database containing over 100,000 unique items systematically varying in stereotype-driven belief strength. Validation studies demonstrate that our LLM-derived belief strength measure correlates strongly with human typicality ratings and robustly predicts human choices in a base-rate neglect task. Additionally, our method revealed substantial and previously unnoticed variability in stereotype-driven belief strength in popular base-rate items from existing research, underlining the need to control for this in future studies. We further highlight methodological improvements achievable by refining the LLM prompt, as well as ways to enhance cross-cultural validity. The database presented here serves as a powerful resource for researchers, facilitating rigorous, replicable, and theoretically precise experimental designs, as well as enabling advancements in cognitive and computational modeling of reasoning. To support its use, we provide the R package baserater, which allows researchers to access the database to apply or adapt the method to their own research.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"43"},"PeriodicalIF":3.9,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146008687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.3758/s13428-025-02822-z
Franziska Henrich, Karl Christoph Klauer
Reaction time data in psychology are frequently censored or truncated. For example, two-alternative forced-choice tasks that are implemented with a response window or response deadline give rise to censored or truncated data. This must be accounted for in the data analysis, as important characteristics of the data, such as the mean, standard deviation, skewness, and correlations, can be strongly affected by censoring or truncation. In this paper, we use the probabilistic programming language Stan to analyze such data with Bayesian diffusion models. For this purpose, we added the functionality to model truncated and censored data with the diffusion model by adding the cumulative distribution function for reaction times generated from the diffusion model and its complement to the source code of Stan. We describe the usage of the truncated and censored models in Stan, test their performance in recovery and simulation-based calibration, and reanalyze existing datasets with the new method. The results of the recovery studies are satisfactory in terms of correlations ( ), coverage (93-95% of true values lie in the 95% highest density interval), and bias. Simulation-based calibration studies suggest that the new functionality is implemented without errors. The reanalysis of existing datasets further validates the new method.
{"title":"Modeling truncated and censored data with the diffusion model in Stan.","authors":"Franziska Henrich, Karl Christoph Klauer","doi":"10.3758/s13428-025-02822-z","DOIUrl":"10.3758/s13428-025-02822-z","url":null,"abstract":"<p><p>Reaction time data in psychology are frequently censored or truncated. For example, two-alternative forced-choice tasks that are implemented with a response window or response deadline give rise to censored or truncated data. This must be accounted for in the data analysis, as important characteristics of the data, such as the mean, standard deviation, skewness, and correlations, can be strongly affected by censoring or truncation. In this paper, we use the probabilistic programming language Stan to analyze such data with Bayesian diffusion models. For this purpose, we added the functionality to model truncated and censored data with the diffusion model by adding the cumulative distribution function for reaction times generated from the diffusion model and its complement to the source code of Stan. We describe the usage of the truncated and censored models in Stan, test their performance in recovery and simulation-based calibration, and reanalyze existing datasets with the new method. The results of the recovery studies are satisfactory in terms of correlations ( <math><mrow><mi>r</mi> <mo>=</mo> <mo>.</mo> <mn>93</mn> <mo>-</mo> <mn>1.00</mn></mrow> </math> ), coverage (93-95% of true values lie in the 95% highest density interval), and bias. Simulation-based calibration studies suggest that the new functionality is implemented without errors. The reanalysis of existing datasets further validates the new method.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"42"},"PeriodicalIF":3.9,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12819533/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146008704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.3758/s13428-025-02926-6
Thomas Schmidt, Maximilian P Wolkersdorfer, Xin Ying Lee, Omar Jubran
One of the most popular approaches to unconscious cognition is the technique of "post hoc selection": Priming effects and visibility ratings are measured in multitasks on the same trial, and only trials with the lowest visibility ratings are selected for analysis of (presumably unconscious) priming effects. In the past, the technique has been criticized for creating statistical artifacts and capitalizing on chance. Here, we argue that post hoc selection constitutes a sampling fallacy, confusing sensitivity and response bias, wrongly ascribing unconscious processing to stimulus conditions that may be far from indiscriminable. In response to a high-profile "best practice" paper by Stockart et al. (2025) that condones the technique, we use standard signal detection theory to show that post hoc selection only isolates trials with neutral response bias, irrespective of actual sensitivity, and thus fails to isolate trials where the critical stimulus is "unconscious". Our own data demonstrate that zero-visibility ratings are consistent with uncomfortably high levels of sensitivity. As an alternative to post hoc selection, we advocate the study of functional dissociations, where direct (D) and indirect (I) measures are conceptualized as spanning a two-dimensional D-I space wherein simple, sensitivity, and double dissociations appear as distinct curve patterns. While Stockart et al.'s recommendations cover only a single line of that space where D is close to zero, functional dissociations can utilize the entire space. This circumvents requirements like null visibility and exhaustive reliability, allows for dissociations among different measures of awareness, and supports the planful measurement of functional relationships between direct and indirect measures.
{"title":"Unconscious cognition without post hoc selection artifacts: From selective analysis to functional dissociations.","authors":"Thomas Schmidt, Maximilian P Wolkersdorfer, Xin Ying Lee, Omar Jubran","doi":"10.3758/s13428-025-02926-6","DOIUrl":"10.3758/s13428-025-02926-6","url":null,"abstract":"<p><p>One of the most popular approaches to unconscious cognition is the technique of \"post hoc selection\": Priming effects and visibility ratings are measured in multitasks on the same trial, and only trials with the lowest visibility ratings are selected for analysis of (presumably unconscious) priming effects. In the past, the technique has been criticized for creating statistical artifacts and capitalizing on chance. Here, we argue that post hoc selection constitutes a sampling fallacy, confusing sensitivity and response bias, wrongly ascribing unconscious processing to stimulus conditions that may be far from indiscriminable. In response to a high-profile \"best practice\" paper by Stockart et al. (2025) that condones the technique, we use standard signal detection theory to show that post hoc selection only isolates trials with neutral response bias, irrespective of actual sensitivity, and thus fails to isolate trials where the critical stimulus is \"unconscious\". Our own data demonstrate that zero-visibility ratings are consistent with uncomfortably high levels of sensitivity. As an alternative to post hoc selection, we advocate the study of functional dissociations, where direct (D) and indirect (I) measures are conceptualized as spanning a two-dimensional D-I space wherein simple, sensitivity, and double dissociations appear as distinct curve patterns. While Stockart et al.'s recommendations cover only a single line of that space where D is close to zero, functional dissociations can utilize the entire space. This circumvents requirements like null visibility and exhaustive reliability, allows for dissociations among different measures of awareness, and supports the planful measurement of functional relationships between direct and indirect measures.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"39"},"PeriodicalIF":3.9,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811327/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145987889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.3758/s13428-025-02922-w
Scott Crossley, Joon Suh Choi, Kenny Tang, Laurie Cutting
This study documents and assesses the Tool for Automatic Analysis of Decoding Ambiguity (TAADA). TAADA calculates measures related to decoding, including metrics for grapheme and phoneme counts, neighborhood effects, rhymes, and conditional probabilities for sound-spelling relationships. These measures are assessed in two reading studies. The first study examined links between decoding variables and judgments of reading ease in a corpus of ~5000 reading excerpts, finding that variables related to word frequency, phonographic neighbors for words, word syllable length, and the reverse prior probability for consonants explained 34% of the variance in the reading scores. The second examined links between decoding variables and student reading miscues, finding that word frequency, phoneme counts, rhyme counts, and probability counts explained 3% of students' reading miscues.
{"title":"The Tool for Automatic Analysis of Decoding Ambiguity (TAADA).","authors":"Scott Crossley, Joon Suh Choi, Kenny Tang, Laurie Cutting","doi":"10.3758/s13428-025-02922-w","DOIUrl":"10.3758/s13428-025-02922-w","url":null,"abstract":"<p><p>This study documents and assesses the Tool for Automatic Analysis of Decoding Ambiguity (TAADA). TAADA calculates measures related to decoding, including metrics for grapheme and phoneme counts, neighborhood effects, rhymes, and conditional probabilities for sound-spelling relationships. These measures are assessed in two reading studies. The first study examined links between decoding variables and judgments of reading ease in a corpus of ~5000 reading excerpts, finding that variables related to word frequency, phonographic neighbors for words, word syllable length, and the reverse prior probability for consonants explained 34% of the variance in the reading scores. The second examined links between decoding variables and student reading miscues, finding that word frequency, phoneme counts, rhyme counts, and probability counts explained 3% of students' reading miscues.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"58 2","pages":"40"},"PeriodicalIF":3.9,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12811272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145987913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}