Educational and Psychological Measurement最新文献_第3页

Detecting Differential Item Functioning Using Response Time. 利用响应时间检测项目功能差异。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-26 DOI: 10.1177/00131644241280400

Qizhou Duan, Ying Cheng

This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ $R^{2}$ and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ $R^{2}$ , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR ² with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.

本研究调查了反应时间中的统一差异项目功能（DIF）检测。我们提出了一种回归分析方法，将工作速度和组员身份作为自变量，将对数转换后的反应时间作为因变量。我们使用Δ R 2 和回归系数变化百分比等效应大小指标，结合统计显著性检验来标记 DIF 项目。我们进行了一项模拟研究，以评估三种 DIF 检测标准的性能：(a) 显著性检验；(b) Δ R 2 的显著性检验；(c) 回归系数百分比变化的显著性检验。模拟研究考虑的因素包括样本量、焦点组在总样本量中所占比例、DIF 项目数和 DIF 量。结果表明，仅使用显著性检验过于严格；使用回归系数的百分比变化作为效应大小衡量标准，在样本量较大时可降低标记率，但在不同条件下效果不一致；使用ΔR 2 并进行显著性检验可降低标记率，且效果相当一致。我们使用 PISA 2018 数据来说明所提方法在真实数据集中的表现。此外，我们还提供了利用响应时间进行 DIF 研究的指南。

{"title":"Detecting Differential Item Functioning Using Response Time.","authors":"Qizhou Duan, Ying Cheng","doi":"10.1177/00131644241280400","DOIUrl":"10.1177/00131644241280400","url":null,"abstract":"This study investigated uniform differential item functioning (DIF) detection in response times. We proposed a regression analysis approach with both the working speed and the group membership as independent variables, and logarithm transformed response times as the dependent variable. Effect size measures such as Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> and percentage change in regression coefficients in conjunction with the statistical significance tests were used to flag DIF items. A simulation study was conducted to assess the performance of three DIF detection criteria: (a) significance test, (b) significance test with Δ <math> <mrow> <msup><mrow><mi>R</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> , and (c) significance test with the percentage change in regression coefficients. The simulation study considered factors such as sample sizes, proportion of the focal group in relation to total sample size, number of DIF items, and the amount of DIF. The results showed that the significance test alone was too strict; using the percentage change in regression coefficients as an effect size measure reduced the flagging rate when the sample size was large, but the effect was inconsistent across different conditions; using ΔR 2 with significance test reduced the flagging rate and was fairly consistent. The PISA 2018 data were used to illustrate the performance of the proposed method in a real dataset. Furthermore, we provide guidelines for conducting DIF studies with response time.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280400"},"PeriodicalIF":2.1,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562889/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations. 利用实验操作评估心理测试中速度与准确性的权衡。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-07 DOI: 10.1177/00131644241271309

Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl

The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (N = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.

速度-准确性权衡（SAT），即反应速度的提高往往会导致准确性的降低，这在实验心理学中已得到公认。然而，它对心理测评的影响，尤其是在高风险环境中的影响，仍然鲜为人知。本研究介绍了一种在高风险空间能力评估中研究 SAT 的实验方法。通过在主体内设计中操纵指令，诱导大量（N = 1305）空中交通管制员培训项目申请者的速度变化，我们证明了操纵工作速度的可行性。我们的研究结果证实了大多数参与者的 SAT 存在，这表明传统的能力分数可能无法完全反映高风险评估中的表现。重要的是，我们观察到了 SAT 的个体差异，这挑战了不同应试者 SAT 功能一致的假设。这些结果凸显了解释高风险评估结果的复杂性，以及考试条件对成绩动态的影响。这项研究为评估心理测试（包括 SAT 研究）中速度和准确性之间的个体内部关系提供了一个宝贵的方法工具包，提供了一种受控方法，同时承认有必要解决潜在的混杂因素。未来的研究可能会在不同的认知领域、人群和测试环境中应用这种方法，以加深我们对 SAT 对心理测量的广泛影响的理解。

{"title":"Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations.","authors":"Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl","doi":"10.1177/00131644241271309","DOIUrl":"10.1177/00131644241271309","url":null,"abstract":"The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (N = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241271309"},"PeriodicalIF":2.1,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Latent Structure Examination of Behavioral Measuring Instruments in Complex Empirical Settings. 论复杂实证环境中行为测量工具的潜在结构检查。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-07 DOI: 10.1177/00131644241281049

Tenko Raykov, Khaled Alkherainej

A multiple-step procedure is outlined that can be used for examining the latent structure of behavior measurement instruments in complex empirical settings. The method permits one to study their latent structure after assessing the need to account for clustering effects and the necessity of its examination within individual levels of fixed factors, such as gender or group membership of substantive relevance. The approach is readily applicable with binary or binary-scored items using popular and widely available software. The described procedure is illustrated with empirical data from a student behavior screening instrument.

本文概述了一种多步骤程序，可用于在复杂的实证环境中研究行为测量工具的潜在结构。在评估是否需要考虑聚类效应以及是否有必要在固定因素（如性别或具有实质性相关性的群体成员资格）的个体水平上对其进行检查之后，该方法允许人们对其潜在结构进行研究。这种方法很容易使用流行且广泛可用的软件来处理二元或二元评分项目。本文以一个学生行为筛查工具的经验数据来说明所述程序。

引用次数: 0

Interpretation of the Standardized Mean Difference Effect Size When Distributions Are Not Normal or Homoscedastic. 当分布非正态分布或同态分布时，标准化均值差异效应大小的解释。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-06 DOI: 10.1177/00131644241278928

Larry V Hedges

The standardized mean difference (sometimes called Cohen's d) is an effect size measure widely used to describe the outcomes of experiments. It is mathematically natural to describe differences between groups of data that are normally distributed with different means but the same standard deviation. In that context, it can be interpreted as determining several indexes of overlap between the two distributions. If the data are not approximately normally distributed or if they have substantially unequal standard deviations, the relation between d and overlap between distributions can be very different, and interpretations of d that apply when the data are normal with equal variances are unreliable.

标准化均值差异（有时称为科恩 d）是一种效应大小测量方法，广泛用于描述实验结果。它在数学上很自然地用于描述具有不同均值但相同标准差的正态分布数据组之间的差异。在这种情况下，它可以解释为确定两个分布之间重叠的几个指数。如果数据不是近似正态分布，或者它们的标准差严重不等，那么 d 与分布间重叠度之间的关系就会截然不同，而适用于数据正态分布且方差相等时的 d 解释是不可靠的。

引用次数: 0

Enhancing Effort-Moderated Item Response Theory Models by Evaluating a Two-Step Estimation Method and Multidimensional Variations on the Model. 通过评估两步估算法和模型的多维变化，改进努力调节的项目反应理论模型。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-06 DOI: 10.1177/00131644241280727

Bowen Wang, Corinne Huggins-Manley, Huan Kuang, Jiawei Xiong

Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderated item response theory model (EM-IRT) to analyze response data with rapid-guessing responses. One is the two-step EM-IRT model, which utilizes the item parameters estimated by respondents without rapid-guessing behavior and was initially proposed by Rios and Soland without further investigation. The other two models are effort-moderated multidimensional models (EM-MIRT), which we introduce in this study and vary as both between-item and within-item structures. The advantage of the EM-MIRT model is to account for the underlying relationship between rapid-guessing propensity and ability. The three models were compared with the traditional EM-IRT model regarding the accuracy of parameter recovery in various simulated conditions. Results demonstrated that the two-step EM-IRT and between-item EM-MIRT model consistently outperformed the traditional EM-IRT model under various conditions, with the two-step EM-IRT estimation generally delivering the best performance, especially for ability and item difficulty parameters estimation. In addition, different rapid-guessing patterns (i.e., difficulty-based, changing state, and decreasing effort) did not affect the performance of the two-step EM-IRT model. Overall, the findings suggest that the EM-IRT model with the two-step parameter estimation method can be applied in practice for estimating ability in the presence of rapid-guessing responses due to its accuracy and efficiency. The between-item EM-MIRT model can be used as an alternative model when there is no significant mean difference in the ability estimates between examinees who exhibit rapid-guessing behavior and those who do not.

数据中的快速猜测行为会影响我们准确估计项目和个人参数的能力。因此，对具有快速猜测模式的数据进行建模，使其能够产生无偏的能力估计值至关重要。本研究提出并评估了三种可供选择的建模方法，它们都遵循努力调节项目反应理论模型（EM-IRT）的逻辑，用于分析具有快速猜测反应的反应数据。其中一种是两步式 EM-IRT 模型，它利用的是没有快速猜测行为的被调查者所估计的项目参数，最初是由 Rios 和 Soland 提出的，没有经过进一步研究。另外两个模型是努力调节多维模型（EM-MIRT），我们在本研究中引入了这两个模型，它们既有项目间结构，也有项目内结构。EM-MIRT 模型的优点是考虑了快速猜测倾向与能力之间的内在关系。我们将这三种模型与传统的 EM-IRT 模型在各种模拟条件下的参数恢复准确性进行了比较。结果表明，在各种条件下，两步式 EM-IRT 模型和项目间 EM-MIRT 模型的性能始终优于传统的 EM-IRT 模型，其中两步式 EM-IRT 估计通常性能最佳，尤其是在能力和项目难度参数估计方面。此外，不同的快速猜测模式（即基于难度、改变状态和减少努力）并不影响两步式 EM-IRT 模型的性能。总之，研究结果表明，采用两步参数估计法的 EM-IRT 模型因其准确性和高效性，可实际用于存在快速猜测反应时的能力估计。当表现出快速猜测行为的考生与未表现出快速猜测行为的考生之间的能力估计平均值无显著差异时，可使用项目间 EM-MIRT 模型作为替代模型。

{"title":"Enhancing Effort-Moderated Item Response Theory Models by Evaluating a Two-Step Estimation Method and Multidimensional Variations on the Model.","authors":"Bowen Wang, Corinne Huggins-Manley, Huan Kuang, Jiawei Xiong","doi":"10.1177/00131644241280727","DOIUrl":"10.1177/00131644241280727","url":null,"abstract":"Rapid-guessing behavior in data can compromise our ability to estimate item and person parameters accurately. Consequently, it is crucial to model data with rapid-guessing patterns in a way that can produce unbiased ability estimates. This study proposes and evaluates three alternative modeling approaches that follow the logic of the effort-moderated item response theory model (EM-IRT) to analyze response data with rapid-guessing responses. One is the two-step EM-IRT model, which utilizes the item parameters estimated by respondents without rapid-guessing behavior and was initially proposed by Rios and Soland without further investigation. The other two models are effort-moderated multidimensional models (EM-MIRT), which we introduce in this study and vary as both between-item and within-item structures. The advantage of the EM-MIRT model is to account for the underlying relationship between rapid-guessing propensity and ability. The three models were compared with the traditional EM-IRT model regarding the accuracy of parameter recovery in various simulated conditions. Results demonstrated that the two-step EM-IRT and between-item EM-MIRT model consistently outperformed the traditional EM-IRT model under various conditions, with the two-step EM-IRT estimation generally delivering the best performance, especially for ability and item difficulty parameters estimation. In addition, different rapid-guessing patterns (i.e., difficulty-based, changing state, and decreasing effort) did not affect the performance of the two-step EM-IRT model. Overall, the findings suggest that the EM-IRT model with the two-step parameter estimation method can be applied in practice for estimating ability in the presence of rapid-guessing responses due to its accuracy and efficiency. The between-item EM-MIRT model can be used as an alternative model when there is no significant mean difference in the ability estimates between examinees who exhibit rapid-guessing behavior and those who do not.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241280727"},"PeriodicalIF":2.1,"publicationDate":"2024-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items. 与人工智能考生一起实地测试多项选择题：英语语法项目。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-03 DOI: 10.1177/00131644241281053

Hotaka Maeda

Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.

在开发高质量的教育评估过程中，实地测试是必不可少的一步，但往往需要耗费大量资源。我介绍了一种创新方法，即用人工智能（AI）考生代替人类考生，对新编写的考试项目进行实地测试。我们使用 466 道四选一的英语语法选择题对所提出的方法进行了演示。预先训练好的转换器语言模型根据 2 参数逻辑（2PL）项目响应模型进行微调，以做出与人类考生类似的响应。每个人工智能考生都与潜在能力 θ 相关联，题目文本用于预测四个回答选项中每个选项的回答选择概率。在确定的最佳建模方法中，真实的 2PL 正确作答概率与预测的 2PL 正确作答概率之间的总体相关性为 0.82（偏差 = 0.00，均方根误差 = 0.18）。研究结果很有希望，表明人工智能生成的项目反应数据可用于计算项目正确率、项目区分度、使用锚点进行项目校准、干扰项分析、维度分析和潜在特质评分。然而，所提出的方法并没有达到使用人类考生答题数据所能达到的准确度。如果进一步改进，从人类实地测试过渡到人工智能实地测试可能会节省大量资源。人工智能可以缩短现场测试的时间，防止考生在真实考试中看到低质量的现场测试项目，缩短测试长度，消除测试安全、项目暴露和样本大小方面的顾虑，降低总体成本，并有助于扩大项目库。本研究的 Python 代码示例可在 Github 上获取：https://github.com/hotakamaeda/ai_field_testing1。

{"title":"Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items.","authors":"Hotaka Maeda","doi":"10.1177/00131644241281053","DOIUrl":"10.1177/00131644241281053","url":null,"abstract":"Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241281053"},"PeriodicalIF":2.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562880/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Benefits of Using Maximal Reliability in Educational and Behavioral Research. 论在教育和行为研究中使用最大信度的好处。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-01 Epub Date: 2023-12-27 DOI: 10.1177/00131644231215771

Tenko Raykov

This note is concerned with the benefits that can result from the use of the maximal reliability and optimal linear combination concepts in educational and psychological research. Within the widely used framework of unidimensional multi-component measuring instruments, it is demonstrated that the linear combination of their components that possesses the highest possible reliability can exhibit a level of consistency considerably exceeding that of their overall sum score that is nearly routinely employed in contemporary empirical research. This optimal linear combination can be particularly useful in circumstances where one or more scale components are associated with relatively large error variances, but their removal from the instrument can lead to a notable loss in validity due to construct underrepresentation. The discussion is illustrated with a numerical example.

本说明涉及在教育和心理学研究中使用最大信度和最佳线性组合概念的好处。在广泛使用的单维度多成分测量工具的框架内，研究表明，具有最高信度的各成分线性组合所表现出的一致性水平可以大大超过其总分的一致性水平，而后者几乎是当代实证研究中经常使用的。当一个或多个量表成分与相对较大的误差方差相关联时，这种最佳线性组合就显得尤为有用，但如果将其从工具中去除，则会因建构的代表性不足而导致效度的显著降低。本讨论将通过一个数字示例进行说明。

引用次数: 0

Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach. 提高项目功能差异幅度预测的精确度：一种 M-DIF 预训练模型方法。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-10-01 DOI: 10.1177/00131644241279882

Shan Huang, Hidetoki Ishii

Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model's generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.

尽管关于差异项目功能（DIF）大小的研究不胜枚举，但不同的 DIF 检测方法对效应大小的定义往往不一致，而且未能充分考虑测试条件。为了解决这些局限性，本研究引入了统一的 M-DIF 模型，该模型将 DIF 的大小定义为参照组和焦点组之间项目难度参数的差异。M-DIF 模型可以将各种 DIF 检测方法和测试条件结合起来，形成一个定量模型。采用预训练方法是为了利用具有足够代表性的大样本作为训练集，确保模型的普适性。一旦构建了预训练模型，就可以直接应用于新数据。具体来说，训练数据集包括 144 种测试条件组合和 144,000 个潜在的 DIF 项目，每个项目都有 29 个统计指标。我们采用 XGBoost 方法进行建模。结果表明，根据均方根误差（RMSE）和 BIAS 指标，M-DIF 模型在两个验证集（一致和不一致测试条件下）的表现都优于基线模型。在所有 360 种测试条件组合（144 种与训练集一致，216 种与训练集不一致）中，M-DIF 模型在 357 种情况下（99.2%）显示出较低的 RMSE，这说明了它的鲁棒性。最后，我们提供了一个实证案例来展示实施 M-DIF 模型的实际可行性。

{"title":"Enhancing Precision in Predicting Magnitude of Differential Item Functioning: An M-DIF Pretrained Model Approach.","authors":"Shan Huang, Hidetoki Ishii","doi":"10.1177/00131644241279882","DOIUrl":"10.1177/00131644241279882","url":null,"abstract":"Despite numerous studies on the magnitude of differential item functioning (DIF), different DIF detection methods often define effect sizes inconsistently and fail to adequately account for testing conditions. To address these limitations, this study introduces the unified M-DIF model, which defines the magnitude of DIF as the difference in item difficulty parameters between reference and focal groups. The M-DIF model can incorporate various DIF detection methods and test conditions to form a quantitative model. The pretrained approach was employed to leverage a sufficiently representative large sample as the training set and ensure the model's generalizability. Once the pretrained model is constructed, it can be directly applied to new data. Specifically, a training dataset comprising 144 combinations of test conditions and 144,000 potential DIF items, each equipped with 29 statistical metrics, was used. We adopt the XGBoost method for modeling. Results show that, based on root mean square error (RMSE) and BIAS metrics, the M-DIF model outperforms the baseline model in both validation sets: under consistent and inconsistent test conditions. Across all 360 combinations of test conditions (144 consistent and 216 inconsistent with the training set), the M-DIF model demonstrates lower RMSE in 357 cases (99.2%), illustrating its robustness. Finally, we provided an empirical example to showcase the practical feasibility of implementing the M-DIF model.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241279882"},"PeriodicalIF":2.1,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using ROC Analysis to Refine Cut Scores Following a Standard Setting Process. 在标准制定过程中使用 ROC 分析法完善切分分数。

IF 2.1 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-09-24 DOI: 10.1177/00131644241278925

Dongwei Wang, Lisa A Keller

In educational assessment, cut scores are often defined through standard setting by a group of subject matter experts. This study aims to investigate the impact of several factors on classification accuracy using the receiver operating characteristic (ROC) analysis to provide statistical and theoretical evidence when the cut score needs to be refined. Factors examined in the study include the sample distribution relative to the cut score, prevalence of the positive event, and cost ratio. Forty item responses were simulated for examinees of four sample distributions. In addition, the prevalence and cost ratio between false negatives and false positives were manipulated to examine their impacts on classification accuracy. The optimal cut score is identified using the Youden Index J. The results showed that the optimal cut score identified by the evaluation criterion tended to pull the cut score closer to the mode of the proficiency distribution. In addition, depending on the prevalence of the positive event and cost ratio, the optimal cut score shifts accordingly. With the item parameters used to simulate the data and the simulated sample distributions, it was found that when passing the exam is a low-prevalence event in the population, increasing the cut score operationally improves the classification; when passing the exam is a high-prevalence event, then cut score should be reduced to achieve optimality. As the cost ratio increases, the optimal cut score suggested by the evaluation criterion decreases. In three out of the four sample distributions examined in this study, increasing the cut score enhanced the classification, irrespective of the cost ratio when the prevalence in the population is 50%. This study provides statistical evidence when the cut score needs to be refined for policy reasons.

在教育评估中，切分通常是由一组学科专家通过制定标准来确定的。本研究旨在利用接收者操作特征（ROC）分析法调查几个因素对分类准确性的影响，以便在需要完善切分分值时提供统计和理论依据。研究中考察的因素包括相对于切分分值的样本分布、阳性事件的发生率和成本比率。针对四种样本分布的受试者模拟了 40 个项目的回答。此外，还对假阴性和假阳性之间的流行率和成本比进行了处理，以检查它们对分类准确性的影响。结果表明，评价标准所确定的最佳切分往往会使切分更接近能力分布的模式。此外，根据正向事件的发生率和成本比率，最佳切分也会相应地发生变化。根据用于模拟数据的项目参数和模拟样本分布，我们发现，当通过考试在人群中属于低流行率事件时，提高切分分值可在操作上改善分类；而当通过考试属于高流行率事件时，则应降低切分分值以达到最优。随着成本比率的增加，评价标准所建议的最优切分分数会降低。在本研究考察的四个样本分布中，有三个样本在人群中的流行率为 50%时，无论成本比如何，提高切分分值都能增强分类效果。本研究为出于政策原因需要完善切分值时提供了统计证据。

{"title":"Using ROC Analysis to Refine Cut Scores Following a Standard Setting Process.","authors":"Dongwei Wang, Lisa A Keller","doi":"10.1177/00131644241278925","DOIUrl":"10.1177/00131644241278925","url":null,"abstract":"In educational assessment, cut scores are often defined through standard setting by a group of subject matter experts. This study aims to investigate the impact of several factors on classification accuracy using the receiver operating characteristic (ROC) analysis to provide statistical and theoretical evidence when the cut score needs to be refined. Factors examined in the study include the sample distribution relative to the cut score, prevalence of the positive event, and cost ratio. Forty item responses were simulated for examinees of four sample distributions. In addition, the prevalence and cost ratio between false negatives and false positives were manipulated to examine their impacts on classification accuracy. The optimal cut score is identified using the Youden Index J. The results showed that the optimal cut score identified by the evaluation criterion tended to pull the cut score closer to the mode of the proficiency distribution. In addition, depending on the prevalence of the positive event and cost ratio, the optimal cut score shifts accordingly. With the item parameters used to simulate the data and the simulated sample distributions, it was found that when passing the exam is a low-prevalence event in the population, increasing the cut score operationally improves the classification; when passing the exam is a high-prevalence event, then cut score should be reduced to achieve optimality. As the cost ratio increases, the optimal cut score suggested by the evaluation criterion decreases. In three out of the four sample distributions examined in this study, increasing the cut score enhanced the classification, irrespective of the cost ratio when the prevalence in the population is 50%. This study provides statistical evidence when the cut score needs to be refined for policy reasons.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644241278925"},"PeriodicalIF":2.1,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562877/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142650503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating the Ordering Structure of Clustered Items Using Nonparametric Item Response Theory 利用非参数项目反应理论研究聚类项目的排序结构

IF 2.7 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement

Pub Date : 2024-09-06 DOI: 10.1177/00131644241274122

Letty Koopman, Johan Braeken

Educational and psychological tests with an ordered item structure enable efficient test administration procedures and allow for intuitive score interpretation and monitoring. The effectiveness of the measurement instrument relies to a large extent on the validated strength of its ordering structure. We define three increasingly strict types of ordering for the ordering structure of a measurement instrument with clustered items: a weak and a strong invariant cluster ordering and a clustered invariant item ordering. Following a nonparametric item response theory (IRT) approach, we proposed a procedure to evaluate the ordering structure of a clustered item set along this three-fold continuum of order invariance. The basis of the procedure is (a) the local assessment of pairwise conditional expectations at both cluster and item level and (b) the global assessment of the number of Guttman errors through new generalizations of the H-coefficient for this item-cluster context. The procedure, readily implemented in R, is illustrated and applied to an empirical example. Suggestions for test practice, further methodological developments, and future research are discussed.

采用有序项目结构的教育和心理测验可以提高测验实施程序的效率，并能对分数进行直观的解释和监控。测量工具的有效性在很大程度上取决于其排序结构的有效强度。我们为具有聚类项目的测量工具的排序结构定义了三种越来越严格的排序类型：弱不变聚类排序和强不变聚类排序，以及聚类不变项目排序。按照非参数项目反应理论（IRT）的方法，我们提出了一种程序，用于根据顺序不变性的三重连续统一体评估聚类项目集的排序结构。该程序的基础是：(a) 在聚类和项目水平上对成对条件期望进行局部评估；(b) 通过对 H 系数进行新的概括，在此项目-聚类背景下对 Guttman 误差的数量进行全局评估。该程序可在 R 中轻松实现，并在一个实证例子中加以说明和应用。此外，还讨论了对测试实践、进一步的方法论发展和未来研究的建议。

{"title":"Investigating the Ordering Structure of Clustered Items Using Nonparametric Item Response Theory","authors":"Letty Koopman, Johan Braeken","doi":"10.1177/00131644241274122","DOIUrl":"https://doi.org/10.1177/00131644241274122","url":null,"abstract":"Educational and psychological tests with an ordered item structure enable efficient test administration procedures and allow for intuitive score interpretation and monitoring. The effectiveness of the measurement instrument relies to a large extent on the validated strength of its ordering structure. We define three increasingly strict types of ordering for the ordering structure of a measurement instrument with clustered items: a weak and a strong invariant cluster ordering and a clustered invariant item ordering. Following a nonparametric item response theory (IRT) approach, we proposed a procedure to evaluate the ordering structure of a clustered item set along this three-fold continuum of order invariance. The basis of the procedure is (a) the local assessment of pairwise conditional expectations at both cluster and item level and (b) the global assessment of the number of Guttman errors through new generalizations of the H-coefficient for this item-cluster context. The procedure, readily implemented in R, is illustrated and applied to an empirical example. Suggestions for test practice, further methodological developments, and future research are discussed.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"108 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0