Pub Date : 2024-11-25DOI: 10.1177/00131644241283400
Matthias Kloft, Daniel W Heck
In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.
{"title":"Discriminant Validity of Interval Response Formats: Investigating the Dimensional Structure of Interval Widths.","authors":"Matthias Kloft, Daniel W Heck","doi":"10.1177/00131644241283400","DOIUrl":"10.1177/00131644241283400","url":null,"abstract":"<p><p>In psychological research, respondents are usually asked to answer questions with a single response value. A useful alternative are interval response formats like the dual-range slider (DRS) where respondents provide an interval with a lower and an upper bound for each item. Interval responses may be used to measure psychological constructs such as variability in the domain of personality (e.g., self-ratings), uncertainty in estimation tasks (e.g., forecasting), and ambiguity in judgments (e.g., concerning the pragmatic use of verbal quantifiers). However, it is unclear whether respondents are sensitive to the requirements of a particular task and whether interval widths actually measure the constructs of interest. To test the discriminant validity of interval widths, we conducted a study in which respondents answered 92 items belonging to seven different tasks from the domains of personality, estimation, and judgment. We investigated the dimensional structure of interval widths by fitting exploratory and confirmatory factor models while using an appropriate multivariate logit function to transform the bounded interval responses. The estimated factorial structure closely followed the theoretically assumed structure of the tasks, which varied in their degree of similarity. We did not find a strong overarching general factor, which speaks against a response style influencing interval widths across all tasks and domains. Overall, this indicates that respondents are sensitive to the requirements of different tasks and domains when using interval response formats.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241283400"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586930/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-25DOI: 10.1177/00131644241296139
Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht
The assessment of individual students is not only crucial in the school setting but also at the core of educational research. Although classical test theory focuses on maximizing insights from student responses, the Bayesian perspective incorporates the assessor's prior belief, thereby enriching assessment with knowledge gained from previous interactions with the student or with similar students. We propose and illustrate a formal Bayesian approach that not only allows to form a stronger belief about a student's competency but also offers a more accurate assessment than classical test theory. In addition, we propose a straightforward method for gauging prior beliefs using two specific items and point to the possibility to integrate additional information.
{"title":"Novick Meets Bayes: Improving the Assessment of Individual Students in Educational Practice and Research by Capitalizing on Assessors' Prior Beliefs.","authors":"Steffen Zitzmann, Gabe A Orona, Julian F Lohmann, Christoph König, Lisa Bardach, Martin Hecht","doi":"10.1177/00131644241296139","DOIUrl":"10.1177/00131644241296139","url":null,"abstract":"<p><p>The assessment of individual students is not only crucial in the school setting but also at the core of educational research. Although classical test theory focuses on maximizing insights from student responses, the Bayesian perspective incorporates the assessor's prior belief, thereby enriching assessment with knowledge gained from previous interactions with the student or with similar students. We propose and illustrate a formal Bayesian approach that not only allows to form a stronger belief about a student's competency but also offers a more accurate assessment than classical test theory. In addition, we propose a straightforward method for gauging prior beliefs using two specific items and point to the possibility to integrate additional information.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241296139"},"PeriodicalIF":2.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11586934/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142727068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-22DOI: 10.1177/00131644241293694
W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch
There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln ) and the variance of the Mantel-Haenszel log odds ratio ( ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.
{"title":"Differential Item Functioning Effect Size Use for Validity Information.","authors":"W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch","doi":"10.1177/00131644241293694","DOIUrl":"10.1177/00131644241293694","url":null,"abstract":"<p><p>There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln <math> <mrow> <msub> <mrow> <mover><mrow><mi>OR</mi></mrow> <mo>¯</mo></mover> </mrow> <mrow><mi>FE</mi></mrow> </msub> </mrow> </math> ) and the variance of the Mantel-Haenszel log odds ratio ( <math> <mrow> <msup> <mrow> <mover><mrow><mi>τ</mi></mrow> <mo>^</mo></mover> </mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241293694"},"PeriodicalIF":2.1,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1177/00131644241290172
Xinran Liu, Daniel McNeish
Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.
{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"<p><p>Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241290172"},"PeriodicalIF":2.1,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1177/00131644241282982
John Protzko
Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (N = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.
{"title":"Invariance: What Does Measurement Invariance Allow Us to Claim?","authors":"John Protzko","doi":"10.1177/00131644241282982","DOIUrl":"10.1177/00131644241282982","url":null,"abstract":"<p><p>Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (<i>N</i> = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241282982"},"PeriodicalIF":2.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1177/00131644241280400
Qizhou Duan, Ying Cheng
This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR2 with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.
{"title":"Detecting Differential Item Functioning Using Response Time.","authors":"Qizhou Duan, Ying Cheng","doi":"10.1177/00131644241280400","DOIUrl":"10.1177/00131644241280400","url":null,"abstract":"<p><p>This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using Δ<i>R</i> <sup>2</sup> with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280400"},"PeriodicalIF":2.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-07DOI: 10.1177/00131644241271309
Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl
The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (N = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.
速度-准确性权衡(SAT),即反应速度的提高往往会导致准确性的降低,这在实验心理学中已得到公认。然而,它对心理测评的影响,尤其是在高风险环境中的影响,仍然鲜为人知。本研究介绍了一种在高风险空间能力评估中研究 SAT 的实验方法。通过在主体内设计中操纵指令,诱导大量(N = 1305)空中交通管制员培训项目申请者的速度变化,我们证明了操纵工作速度的可行性。我们的研究结果证实了大多数参与者的 SAT 存在,这表明传统的能力分数可能无法完全反映高风险评估中的表现。重要的是,我们观察到了 SAT 的个体差异,这挑战了不同应试者 SAT 功能一致的假设。这些结果凸显了解释高风险评估结果的复杂性,以及考试条件对成绩动态的影响。这项研究为评估心理测试(包括 SAT 研究)中速度和准确性之间的个体内部关系提供了一个宝贵的方法工具包,提供了一种受控方法,同时承认有必要解决潜在的混杂因素。未来的研究可能会在不同的认知领域、人群和测试环境中应用这种方法,以加深我们对 SAT 对心理测量的广泛影响的理解。
{"title":"Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations.","authors":"Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl","doi":"10.1177/00131644241271309","DOIUrl":"10.1177/00131644241271309","url":null,"abstract":"<p><p>The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (<i>N</i> = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241271309"},"PeriodicalIF":2.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-07DOI: 10.1177/00131644241281049
Tenko Raykov, Khaled Alkherainej
A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.
{"title":"On Latent Structure Examination of Behavioral Measuring Instruments in Complex Empirical Settings.","authors":"Tenko Raykov, Khaled Alkherainej","doi":"10.1177/00131644241281049","DOIUrl":"10.1177/00131644241281049","url":null,"abstract":"<p><p>A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241281049"},"PeriodicalIF":2.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562891/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-06DOI: 10.1177/00131644241278928
Larry V Hedges
The standardized mean difference (sometimes called Cohen's d) is an effect size measure widely used to describe the outcomes of experiments. It is mathematically natural to describe differences between groups of data that are normally distributed with different means but the same standard deviation. In that context, it can be interpreted as determining several indexes of overlap between the two distributions. If the data are not approximately normally distributed or if they have substantially unequal standard deviations, the relation between d and overlap between distributions can be very different, and interpretations of d that apply when the data are normal with equal variances are unreliable.
标准化均值差异(有时称为科恩 d)是一种效应大小测量方法,广泛用于描述实验结果。它在数学上很自然地用于描述具有不同均值但相同标准差的正态分布数据组之间的差异。在这种情况下,它可以解释为确定两个分布之间重叠的几个指数。如果数据不是近似正态分布,或者它们的标准差严重不等,那么 d 与分布间重叠度之间的关系就会截然不同,而适用于数据正态分布且方差相等时的 d 解释是不可靠的。
{"title":"Interpretation of the Standardized Mean Difference Effect Size When Distributions Are Not Normal or Homoscedastic.","authors":"Larry V Hedges","doi":"10.1177/00131644241278928","DOIUrl":"10.1177/00131644241278928","url":null,"abstract":"<p><p>The standardized mean difference (sometimes called Cohen's d) is an effect size measure widely used to describe the outcomes of experiments. It is mathematically natural to describe differences between groups of data that are normally distributed with different means but the same standard deviation. In that context, it can be interpreted as determining several indexes of overlap between the two distributions. If the data are not approximately normally distributed or if they have substantially unequal standard deviations, the relation between d and overlap between distributions can be very different, and interpretations of d that apply when the data are normal with equal variances are unreliable.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241278928"},"PeriodicalIF":2.1,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562970/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderated item response theory model (EM-IRT) to analyze response data with rapid-guessing responses. One is the two-step EM-IRT model, which utilizes the item parameters estimated by respondents without rapid-guessing behavior and was initially proposed by Rios and Soland without further investigation. The other two models are effort-moderated multidimensional models (EM-MIRT), which we introduce in this study and vary as both between-item and within-item structures. The advantage of the EM-MIRT model is to account for the underlying relationship between rapid-guessing propensity and ability. The three models were compared with the traditional EM-IRT model regarding the accuracy of parameter recovery in various simulated conditions. Results demonstrated that the two-step EM-IRT and between-item EM-MIRT model consistently outperformed the traditional EM-IRT model under various conditions, with the two-step EM-IRT estimation generally delivering the best performance, especially for ability and item difficulty parameters estimation. In addition, different rapid-guessing patterns (i.e., difficulty-based, changing state, and decreasing effort) did not affect the performance of the two-step EM-IRT model. Overall, the findings suggest that the EM-IRT model with the two-step parameter estimation method can be applied in practice for estimating ability in the presence of rapid-guessing responses due to its accuracy and efficiency. The between-item EM-MIRT model can be used as an alternative model when there is no significant mean difference in the ability estimates between examinees who exhibit rapid-guessing behavior and those who do not.
{"title":"Enhancing Effort-Moderated Item Response Theory Models by Evaluating a Two-Step Estimation Method and Multidimensional Variations on the Model.","authors":"Bowen Wang, Corinne Huggins-Manley, Huan Kuang, Jiawei Xiong","doi":"10.1177/00131644241280727","DOIUrl":"10.1177/00131644241280727","url":null,"abstract":"<p><p>Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderated item response theory model (EM-IRT) to analyze response data with rapid-guessing responses. One is the two-step EM-IRT model, which utilizes the item parameters estimated by respondents without rapid-guessing behavior and was initially proposed by Rios and Soland without further investigation. The other two models are effort-moderated multidimensional models (EM-MIRT), which we introduce in this study and vary as both between-item and within-item structures. The advantage of the EM-MIRT model is to account for the underlying relationship between rapid-guessing propensity and ability. The three models were compared with the traditional EM-IRT model regarding the accuracy of parameter recovery in various simulated conditions. Results demonstrated that the two-step EM-IRT and between-item EM-MIRT model consistently outperformed the traditional EM-IRT model under various conditions, with the two-step EM-IRT estimation generally delivering the best performance, especially for ability and item difficulty parameters estimation. In addition, different rapid-guessing patterns (i.e., difficulty-based, changing state, and decreasing effort) did not affect the performance of the two-step EM-IRT model. Overall, the findings suggest that the EM-IRT model with the two-step parameter estimation method can be applied in practice for estimating ability in the presence of rapid-guessing responses due to its accuracy and efficiency. The between-item EM-MIRT model can be used as an alternative model when there is no significant mean difference in the ability estimates between examinees who exhibit rapid-guessing behavior and those who do not.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280727"},"PeriodicalIF":2.1,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}