首页 > 最新文献

Educational and Psychological Measurement最新文献

英文 中文
Invariance: What Does Measurement Invariance Allow Us to Claim? 不变性:测量不变性能让我们宣称什么?
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2024-10-28 DOI: 10.1177/00131644241282982
John Protzko

Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (N = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.

测量涉及许多理论和经验步骤--确保我们的测量在不同群体中的操作相同就是其中一步。当一个量表的因子载荷和项目截距或阈值在不同群体中处于同一潜变量水平的人身上运行相似时,就会出现测量不变性。这通常被假定为量表在这些群体中测量的是相同的东西。在这里,我们通过随机分配美国成年人(N = 1500)填写量表,评估一个连贯因子(寻找人生意义)或一个什么都不测量的无意义因子,来检验将测量不变性扩展到共同测量的假设。我们发现,在无意义量表中,什么都不测量的项目显示出与原始量表很强的测量不变性、可靠性以及与其他结构的协变性。我们表明,测量不变性可以在没有测量的情况下发生。因此,我们不能推断测量不变性意味着测量的是同一事物,它可能是一个必要条件,但不是充分条件。
{"title":"Invariance: What Does Measurement Invariance Allow Us to Claim?","authors":"John Protzko","doi":"10.1177/00131644241282982","DOIUrl":"10.1177/00131644241282982","url":null,"abstract":"<p><p>Measurement involves numerous theoretical and empirical steps-ensuring our measures are operating the same in different groups is one step. Measurement invariance occurs when the factor loadings and item intercepts or thresholds of a scale operate similarly for people at the same level of the latent variable in different groups. This is commonly assumed to mean the scale is measuring the same thing in those groups. Here we test the assumption of extending measurement invariance to mean common measurement by randomly assigning American adults (<i>N</i> = 1500) to fill out scales assessing a coherent factor (search for meaning in life) or a nonsense factor measuring nothing. We find a nonsense scale with items measuring nothing shows strong measurement invariance with the original scale, is reliable, and covaries with other constructs. We show measurement invariance can occur without measurement. Thus, we cannot infer that measurement invariance means one is measuring the same thing, it may be a necessary but not a sufficient condition.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"458-482"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562939/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Performance of a Regularized Differential Item Functioning Method for Testlet-Based Polytomous Items. 评估基于测试的多同构项目的正则化微分项目功能方法的性能。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-05-31 DOI: 10.1177/00131644251342512
Jing Huang, M David Miller, Anne Corinne Huggins-Manley, Walter L Leite, Herman T Knopf, Albert D Ritzhaupt

This study investigated the effect of testlets on regularization-based differential item functioning (DIF) detection in polytomous items, focusing on the generalized partial credit model with lasso penalization (GPCMlasso) DIF method. Five factors were manipulated: sample size, magnitude of testlet effect, magnitude of DIF, number of DIF items, and type of DIF-inducing covariates. Model performance was evaluated using false-positive rate (FPR) and true-positive rate (TPR). Results showed that the simulation had effective control of FPR across conditions, while the TPR was differentially influenced by the manipulated factors. Generally, the small testlet effect did not noticeably affect the GPCMlasso model's performance regarding FPR and TPR. The findings provide evidence of the effectiveness of the GPCMlasso method for DIF detection in polytomous items when testlets were present. The implications for future research and limitations were also discussed.

本研究探讨了测试对基于正则化的多同构项目差异项目功能(DIF)检测的影响,重点研究了基于套索惩罚的广义部分信用模型(GPCMlasso) DIF方法。五个因素被操纵:样本量、测试效应的大小、DIF的大小、DIF项目的数量和诱发DIF的协变量类型。采用假阳性率(FPR)和真阳性率(TPR)评价模型性能。结果表明,仿真对不同工况下的FPR有较好的控制效果,但TPR受操纵因素的影响存在差异。一般来说,小测试子效应对GPCMlasso模型在FPR和TPR方面的性能影响不明显。研究结果提供了证据的有效性的GPCMlasso方法的DIF检测多染色体项目时,存在的测试。讨论了未来研究的意义和局限性。
{"title":"Evaluating the Performance of a Regularized Differential Item Functioning Method for Testlet-Based Polytomous Items.","authors":"Jing Huang, M David Miller, Anne Corinne Huggins-Manley, Walter L Leite, Herman T Knopf, Albert D Ritzhaupt","doi":"10.1177/00131644251342512","DOIUrl":"10.1177/00131644251342512","url":null,"abstract":"<p><p>This study investigated the effect of testlets on regularization-based differential item functioning (DIF) detection in polytomous items, focusing on the generalized partial credit model with lasso penalization (GPCMlasso) DIF method. Five factors were manipulated: sample size, magnitude of testlet effect, magnitude of DIF, number of DIF items, and type of DIF-inducing covariates. Model performance was evaluated using false-positive rate (FPR) and true-positive rate (TPR). Results showed that the simulation had effective control of FPR across conditions, while the TPR was differentially influenced by the manipulated factors. Generally, the small testlet effect did not noticeably affect the GPCMlasso model's performance regarding FPR and TPR. The findings provide evidence of the effectiveness of the GPCMlasso method for DIF detection in polytomous items when testlets were present. The implications for future research and limitations were also discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251342512"},"PeriodicalIF":2.1,"publicationDate":"2025-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12126468/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144207999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beta-Binomial Model for Count Data: An Application in Estimating Model-Based Oral Reading Fluency. 计数数据的β -二项模型:在评估基于模型的口语阅读流畅性中的应用。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-05-30 DOI: 10.1177/00131644251335914
Xin Qiao, Akihito Kamata, Yusuf Kara, Cornelis Potgieter, Joseph F T Nese

In this article, the beta-binomial model for count data is proposed and demonstrated in terms of its application in the context of oral reading fluency (ORF) assessment, where the number of words read correctly (WRC) is of interest. Existing studies adopted the binomial model for count data in similar assessment scenarios. The beta-binomial model, however, takes into account extra variability in count data that have been neglected by the binomial model. Therefore, it accommodates potential overdispersion in count data compared to the binomial model. To estimate model-based ORF scores, WRC and response times were jointly modeled. The full Bayesian Markov chain Monte Carlo method was adopted for model parameter estimation. A simulation study showed adequate parameter recovery of the beta-binomial model and evaluated the performance of model fit indices in selecting the true data-generating models. Further, an empirical analysis illustrated the application of the proposed model using a dataset from a computerized ORF assessment. The obtained findings were consistent with the simulation study and demonstrated the utility of adopting the beta-binomial model for count-type item responses from assessment data.

在本文中,提出了计数数据的β -二项模型,并就其在口语阅读流畅性(ORF)评估中的应用进行了演示,其中正确阅读的单词数(WRC)是感兴趣的。现有研究对类似评估情景的计数数据采用二项模型。然而,β -二项模型考虑到了二项模型所忽略的计数数据的额外可变性。因此,与二项模型相比,它可以容纳计数数据中潜在的过分散。为了估计基于模型的ORF分数,WRC和响应时间被联合建模。模型参数估计采用全贝叶斯马尔可夫链蒙特卡罗方法。仿真研究表明,β -二项模型具有足够的参数恢复能力,并评价了模型拟合指标在选择真实数据生成模型中的性能。此外,利用计算机化ORF评估的数据集进行实证分析,说明了所提出模型的应用。得到的结果与模拟研究一致,并证明了采用β -二项模型对评估数据的计数型项目反应的效用。
{"title":"Beta-Binomial Model for Count Data: An Application in Estimating Model-Based Oral Reading Fluency.","authors":"Xin Qiao, Akihito Kamata, Yusuf Kara, Cornelis Potgieter, Joseph F T Nese","doi":"10.1177/00131644251335914","DOIUrl":"10.1177/00131644251335914","url":null,"abstract":"<p><p>In this article, the beta-binomial model for count data is proposed and demonstrated in terms of its application in the context of oral reading fluency (ORF) assessment, where the number of words read correctly (WRC) is of interest. Existing studies adopted the binomial model for count data in similar assessment scenarios. The beta-binomial model, however, takes into account extra variability in count data that have been neglected by the binomial model. Therefore, it accommodates potential overdispersion in count data compared to the binomial model. To estimate model-based ORF scores, WRC and response times were jointly modeled. The full Bayesian Markov chain Monte Carlo method was adopted for model parameter estimation. A simulation study showed adequate parameter recovery of the beta-binomial model and evaluated the performance of model fit indices in selecting the true data-generating models. Further, an empirical analysis illustrated the application of the proposed model using a dataset from a computerized ORF assessment. The obtained findings were consistent with the simulation study and demonstrated the utility of adopting the beta-binomial model for count-type item responses from assessment data.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251335914"},"PeriodicalIF":2.1,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125017/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144198554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian Thurstonian IRT Modeling: Logical Dependencies as an Accurate Reflection of Thurstone's Law of Comparative Judgment. 贝叶斯瑟斯顿IRT模型:逻辑依赖是瑟斯顿比较判断定律的准确反映。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-05-30 DOI: 10.1177/00131644251335586
Hannah Heister, Philipp Doebler, Susanne Frick

Thurstonian item response theory (Thurstonian IRT) is a well-established approach to latent trait estimation with forced choice data of arbitrary block lengths. In the forced choice format, test takers rank statements within each block. This rank is coded with binary variables. Since each rank is awarded exactly once per block, stochastic dependencies arise, for example, when options A and B have ranks 1 and 3, C must have rank 2 in a block of length 3. Although the original implementation of the Thurstonian IRT model can recover parameters well, it is not completely true to the mathematical model and Thurstone's law of comparative judgment, as impossible binary answer patterns have a positive probability. We refer to this problem as stochastic dependencies and it is due to unconstrained item intercepts. In addition, there are redundant binary comparisons resulting in what we call logical dependencies, for example, if within a block A < B and B < C , then A < C must follow and a binary variable for A < C is not needed. Since current Markov Chain Monte Carlo approaches to Bayesian computation are flexible and at the same time promise correct small sample inference, we investigate an alternative Bayesian implementation of the Thurstonian IRT model considering both stochastic and logical dependencies. We show analytically that the same parameters maximize the posterior likelihood, regardless of the presence or absence of redundant binary comparisons. A comparative simulation reveals a large reduction in computational effort for the alternative implementation, which is due to respecting both dependencies. Therefore, this investigation suggests that when fitting the Thurstonian IRT model, all dependencies should be considered.

Thurstonian项目反应理论(Thurstonian IRT)是一种基于任意块长度强迫选择数据的潜在特质估计方法。在强制选择的形式中,考生在每个单元中对语句进行排名。这个排名是用二进制变量编码的。由于每个区块只授予一次等级,因此会产生随机依赖性,例如,当选项A和B的等级分别为1和3时,C在长度为3的区块中必须是2级。虽然最初实现的thurston IRT模型可以很好地恢复参数,但它并不完全符合数学模型和Thurstone比较判断定律,因为不可能的二元答案模式具有正概率。我们把这个问题称为随机依赖,它是由于无约束的项目拦截。此外,存在冗余的二进制比较,导致我们所谓的逻辑依赖,例如,如果在块中有a B和B C,则必须遵循a C,并且不需要a C的二进制变量。由于目前贝叶斯计算的马尔可夫链蒙特卡罗方法是灵活的,同时承诺正确的小样本推理,我们研究了考虑随机和逻辑依赖的Thurstonian IRT模型的替代贝叶斯实现。我们分析表明,无论是否存在冗余的二进制比较,相同的参数都能使后验似然最大化。对比模拟显示,由于尊重这两种依赖关系,替代实现的计算工作量大大减少。因此,本研究提示在拟合thurston IRT模型时,应考虑所有的依赖关系。
{"title":"Bayesian Thurstonian IRT Modeling: Logical Dependencies as an Accurate Reflection of Thurstone's Law of Comparative Judgment.","authors":"Hannah Heister, Philipp Doebler, Susanne Frick","doi":"10.1177/00131644251335586","DOIUrl":"10.1177/00131644251335586","url":null,"abstract":"<p><p>Thurstonian item response theory (Thurstonian IRT) is a well-established approach to latent trait estimation with forced choice data of arbitrary block lengths. In the forced choice format, test takers rank statements within each block. This rank is coded with binary variables. Since each rank is awarded exactly once per block, stochastic dependencies arise, for example, when options A and B have ranks 1 and 3, C must have rank 2 in a block of length 3. Although the original implementation of the Thurstonian IRT model can recover parameters well, it is not completely true to the mathematical model and Thurstone's law of comparative judgment, as impossible binary answer patterns have a positive probability. We refer to this problem as stochastic dependencies and it is due to unconstrained item intercepts. In addition, there are redundant binary comparisons resulting in what we call logical dependencies, for example, if within a block <math><mrow><mi>A</mi> <mo><</mo> <mi>B</mi></mrow> </math> and <math><mrow><mi>B</mi> <mo><</mo> <mi>C</mi></mrow> </math> , then <math><mrow><mi>A</mi> <mo><</mo> <mi>C</mi></mrow> </math> must follow and a binary variable for <math><mrow><mi>A</mi> <mo><</mo> <mi>C</mi></mrow> </math> is not needed. Since current Markov Chain Monte Carlo approaches to Bayesian computation are flexible and at the same time promise correct small sample inference, we investigate an alternative Bayesian implementation of the Thurstonian IRT model considering both stochastic and logical dependencies. We show analytically that the same parameters maximize the posterior likelihood, regardless of the presence or absence of redundant binary comparisons. A comparative simulation reveals a large reduction in computational effort for the alternative implementation, which is due to respecting both dependencies. Therefore, this investigation suggests that when fitting the Thurstonian IRT model, all dependencies should be considered.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251335586"},"PeriodicalIF":2.1,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12125010/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144198553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Biclustering to Detect Cheating in Real Time on Mixed-Format Tests. 用双聚类实时检测混合格式考试作弊。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-05-24 DOI: 10.1177/00131644251333143
Hyeryung Lee, Walter P Vispoel

We evaluated a real-time biclustering method for detecting cheating on mixed-format assessments that included dichotomous, polytomous, and multi-part items. Biclustering jointly groups examinees and items by identifying subgroups of test takers who exhibit similar response patterns on specific subsets of items. This method's flexibility and minimal assumptions about examinee behavior make it computationally efficient and highly adaptable. To further finetune accuracy and reduce false positives in real-time detection, enhanced statistical significance tests were incorporated into the illustrated algorithms. Two simulation studies were conducted to assess detection across varying testing conditions. In the first study, the method effectively detected cheating on tests composed entirely of either dichotomous or non-dichotomous items. In the second study, we examined tests with varying mixed item formats and again observed strong detection performance. In both studies, detection performance was examined at each timestamp in real time and evaluated under three varying conditions: proportion of cheaters, cheating group size, and proportion of compromised items. Across conditions, the method demonstrated strong computational efficiency, underscoring its suitability for real-time applications. Overall, these results highlight the adaptability, versatility, and effectiveness of biclustering in detecting cheating in real time while maintaining low false-positive rates.

我们评估了一种实时双聚类方法,用于检测混合格式评估中的作弊行为,包括二分类、多分类和多部分项目。双聚类通过识别在特定项目子集上表现出相似反应模式的考生的子组来联合分组考生和项目。该方法的灵活性和对考生行为的最小假设使其计算效率高,适应性强。为了进一步微调准确性并减少实时检测中的误报,在所示算法中加入了增强的统计显著性检验。进行了两个模拟研究,以评估在不同测试条件下的检测。在第一项研究中,该方法有效地检测了完全由二分类或非二分类组成的测试中的作弊行为。在第二项研究中,我们检查了不同混合项目格式的测试,再次观察到很强的检测性能。在这两项研究中,检测性能在每个时间戳都被实时检查,并在三种不同的条件下进行评估:作弊者的比例、作弊群体的规模和受损物品的比例。在各种条件下,该方法都显示出强大的计算效率,强调了其对实时应用的适用性。总的来说,这些结果突出了双聚类在实时检测作弊同时保持低假阳性率方面的适应性、多功能性和有效性。
{"title":"Using Biclustering to Detect Cheating in Real Time on Mixed-Format Tests.","authors":"Hyeryung Lee, Walter P Vispoel","doi":"10.1177/00131644251333143","DOIUrl":"10.1177/00131644251333143","url":null,"abstract":"<p><p>We evaluated a real-time biclustering method for detecting cheating on mixed-format assessments that included dichotomous, polytomous, and multi-part items. Biclustering jointly groups examinees and items by identifying subgroups of test takers who exhibit similar response patterns on specific subsets of items. This method's flexibility and minimal assumptions about examinee behavior make it computationally efficient and highly adaptable. To further finetune accuracy and reduce false positives in real-time detection, enhanced statistical significance tests were incorporated into the illustrated algorithms. Two simulation studies were conducted to assess detection across varying testing conditions. In the first study, the method effectively detected cheating on tests composed entirely of either dichotomous or non-dichotomous items. In the second study, we examined tests with varying mixed item formats and again observed strong detection performance. In both studies, detection performance was examined at each timestamp in real time and evaluated under three varying conditions: proportion of cheaters, cheating group size, and proportion of compromised items. Across conditions, the method demonstrated strong computational efficiency, underscoring its suitability for real-time applications. Overall, these results highlight the adaptability, versatility, and effectiveness of biclustering in detecting cheating in real time while maintaining low false-positive rates.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251333143"},"PeriodicalIF":2.1,"publicationDate":"2025-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12104213/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144156794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Deep Reinforcement Learning to Decide Test Length. 使用深度强化学习来决定测试长度。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-05-03 DOI: 10.1177/00131644251332972
James Zoucha, Igor Himelfarb, Nai-En Tang

This study explored the application of deep reinforcement learning (DRL) as an innovative approach to optimize test length. The primary focus was to evaluate whether the current length of the National Board of Chiropractic Examiners Part I Exam is justified. By modeling the problem as a combinatorial optimization task within a Markov Decision Process framework, an algorithm capable of constructing test forms from a finite set of items while adhering to critical structural constraints, such as content representation and item difficulty distribution, was used. The findings reveal that although the DRL algorithm was successful in identifying shorter test forms that maintained comparable ability estimation accuracy, the existing test length of 240 items remains advisable as we found shorter test forms did not maintain structural constraints. Furthermore, the study highlighted the inherent adaptability of DRL to continuously learn about a test-taker's latent abilities and dynamically adjust to their response patterns, making it well-suited for personalized testing environments. This dynamic capability supports real-time decision-making in item selection, improving both efficiency and precision in ability estimation. Future research is encouraged to focus on expanding the item bank and leveraging advanced computational resources to enhance the algorithm's search capacity for shorter, structurally compliant test forms.

本研究探索了深度强化学习(DRL)作为优化测试长度的创新方法的应用。主要的焦点是评估目前全国脊医考试委员会第一部分考试的长度是否合理。通过将问题建模为马尔可夫决策过程框架中的组合优化任务,使用了一种算法,该算法能够从有限的一组项目中构建测试表单,同时遵守关键的结构约束,如内容表示和项目难度分布。研究结果表明,尽管DRL算法成功地识别了较短的测试表格,并保持了相当的能力估计准确性,但现有的240个项目的测试长度仍然是可取的,因为我们发现较短的测试表格没有保持结构约束。此外,该研究还强调了DRL固有的适应性,即不断了解考生的潜在能力并动态调整他们的反应模式,使其非常适合个性化的测试环境。这种动态能力支持项目选择的实时决策,提高了能力估计的效率和精度。鼓励未来的研究将重点放在扩展题库和利用先进的计算资源来增强算法对较短的、结构兼容的测试表单的搜索能力。
{"title":"Using Deep Reinforcement Learning to Decide Test Length.","authors":"James Zoucha, Igor Himelfarb, Nai-En Tang","doi":"10.1177/00131644251332972","DOIUrl":"https://doi.org/10.1177/00131644251332972","url":null,"abstract":"<p><p>This study explored the application of deep reinforcement learning (DRL) as an innovative approach to optimize test length. The primary focus was to evaluate whether the current length of the National Board of Chiropractic Examiners Part I Exam is justified. By modeling the problem as a combinatorial optimization task within a Markov Decision Process framework, an algorithm capable of constructing test forms from a finite set of items while adhering to critical structural constraints, such as content representation and item difficulty distribution, was used. The findings reveal that although the DRL algorithm was successful in identifying shorter test forms that maintained comparable ability estimation accuracy, the existing test length of 240 items remains advisable as we found shorter test forms did not maintain structural constraints. Furthermore, the study highlighted the inherent adaptability of DRL to continuously learn about a test-taker's latent abilities and dynamically adjust to their response patterns, making it well-suited for personalized testing environments. This dynamic capability supports real-time decision-making in item selection, improving both efficiency and precision in ability estimation. Future research is encouraged to focus on expanding the item bank and leveraging advanced computational resources to enhance the algorithm's search capacity for shorter, structurally compliant test forms.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251332972"},"PeriodicalIF":2.1,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12049363/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143988676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Change in Adjusted R-Square and R-Square Indices: A Latent Variable Method Application. 评价调整后r方和r方指数的变化:一种潜在变量法的应用。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-11 DOI: 10.1177/00131644251329178
Tenko Raykov, Christine DiStefano

A procedure for interval estimation of the difference in the adjusted R-square index for nested linear models is discussed. The method yields as a byproduct confidence intervals for their standard R-square difference, as well as for the adjusted and standard R-squares associated with each model. The resulting interval estimate of the difference in adjusted R-square represents a useful and informative complement to the commonly used R-square change statistic and its significance test in model selection and contains substantially more information than that test. The outlined procedure is readily employed with popular software in empirical educational and psychological studies and is illustrated with numerical data.

讨论了嵌套线性模型调整后r平方指数差的区间估计方法。该方法产生其标准r平方差的副产物置信区间,以及与每个模型相关的调整和标准r平方的置信区间。调整后r方差异的区间估计值是对模型选择中常用的r方变化统计量及其显著性检验的有用且信息丰富的补充,并且包含比该检验多得多的信息。概述的程序很容易在经验教育和心理学研究中使用流行的软件,并用数值数据说明。
{"title":"Evaluating Change in Adjusted <i>R</i>-Square and <i>R</i>-Square Indices: A Latent Variable Method Application.","authors":"Tenko Raykov, Christine DiStefano","doi":"10.1177/00131644251329178","DOIUrl":"https://doi.org/10.1177/00131644251329178","url":null,"abstract":"<p><p>A procedure for interval estimation of the difference in the adjusted <i>R</i>-square index for nested linear models is discussed. The method yields as a byproduct confidence intervals for their standard <i>R</i>-square difference, as well as for the adjusted and standard <i>R</i>-squares associated with each model. The resulting interval estimate of the difference in adjusted <i>R</i>-square represents a useful and informative complement to the commonly used <i>R</i>-square change statistic and its significance test in model selection and contains substantially more information than that test. The outlined procedure is readily employed with popular software in empirical educational and psychological studies and is illustrated with numerical data.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251329178"},"PeriodicalIF":2.1,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11993540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143985479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Differential Item Functioning Effect Size Use for Validity Information. 差异项目功能效应大小用于有效性信息。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-01 Epub Date: 2024-11-22 DOI: 10.1177/00131644241293694
W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch

There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln OR ¯ FE ) and the variance of the Mantel-Haenszel log odds ratio ( τ ^ 2 ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.

人们一直在强调差异项目功能(DIF)的效应大小,目的是了解通过统计显著性检验发现的差异的程度。根据分析方法的不同,提出了几种不同的效应大小,以及不同的解释准则。本模拟研究的目的是比较用于量化和比较两个评估中 DIF 量的 DIF 效果大小测量的性能。对一些被认为会影响效应大小或已知会影响 DIF 检测的因素进行了操作。本研究提出了以下两个问题。首先,效应大小是否准确地反映了各项目之间的总体 DIF?其次,效应大小是否能准确确定哪项评估的 DIF 量最少?我们强调了在几种模拟条件下表现良好的效应大小。我们还将这些效应量应用于一个真实数据集,以提供一个示例。研究结果表明,固定效应的对数几率比(Ln OR ¯ FE)和曼特尔-海恩泽尔对数几率比的方差(τ ^ 2)对于识别哪种测试包含更多的 DIF 最为准确。我们指出了这项工作的未来方向,有助于继续关注效应大小以了解 DIF 的程度。
{"title":"Differential Item Functioning Effect Size Use for Validity Information.","authors":"W Holmes Finch, Maria Dolores Hidalgo Montesinos, Brian F French, Maria Hernandez Finch","doi":"10.1177/00131644241293694","DOIUrl":"10.1177/00131644241293694","url":null,"abstract":"<p><p>There has been an emphasis on effect sizes for differential item functioning (DIF) with the purpose to understand the magnitude of the differences that are detected through statistical significance testing. Several different effect sizes have been suggested that correspond to the method used for analysis, as have different guidelines for interpretation. The purpose of this simulation study was to compare the performance of the DIF effect size measures described for quantifying and comparing the amount of DIF in two assessments. Several factors were manipulated that were thought to influence the effect sizes or are known to influence DIF detection. This study asked the following two questions. First, do the effect sizes accurately capture aggregate DIF across items? Second, do effect sizes accurately identify which assessment has the least amount of DIF? We highlight effect sizes that had support for performing well across several simulated conditions. We also apply these effect sizes to a real data set to provide an example. Results of the study revealed that the log odds ratio of fixed effects (Ln <math> <mrow> <msub> <mrow> <mover><mrow><mi>OR</mi></mrow> <mo>¯</mo></mover> </mrow> <mrow><mi>FE</mi></mrow> </msub> </mrow> </math> ) and the variance of the Mantel-Haenszel log odds ratio ( <math> <mrow> <msup> <mrow> <mover><mrow><mi>τ</mi></mrow> <mo>^</mo></mover> </mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> ) were most accurate for identifying which test contains more DIF. We point to future directions with this work to aid the continued focus on effect sizes to understand DIF magnitude.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"258-276"},"PeriodicalIF":2.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11583394/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142709569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items. 与人工智能考生一起实地测试多项选择题:英语语法项目。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-01 Epub Date: 2024-10-03 DOI: 10.1177/00131644241281053
Hotaka Maeda

Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.

在开发高质量的教育评估过程中,实地测试是必不可少的一步,但往往需要耗费大量资源。我介绍了一种创新方法,即用人工智能(AI)考生代替人类考生,对新编写的考试项目进行实地测试。我们使用 466 道四选一的英语语法选择题对所提出的方法进行了演示。预先训练好的转换器语言模型根据 2 参数逻辑(2PL)项目响应模型进行微调,以做出与人类考生类似的响应。每个人工智能考生都与潜在能力 θ 相关联,题目文本用于预测四个回答选项中每个选项的回答选择概率。在确定的最佳建模方法中,真实的 2PL 正确作答概率与预测的 2PL 正确作答概率之间的总体相关性为 0.82(偏差 = 0.00,均方根误差 = 0.18)。研究结果很有希望,表明人工智能生成的项目反应数据可用于计算项目正确率、项目区分度、使用锚点进行项目校准、干扰项分析、维度分析和潜在特质评分。然而,所提出的方法并没有达到使用人类考生答题数据所能达到的准确度。如果进一步改进,从人类实地测试过渡到人工智能实地测试可能会节省大量资源。人工智能可以缩短现场测试的时间,防止考生在真实考试中看到低质量的现场测试项目,缩短测试长度,消除测试安全、项目暴露和样本大小方面的顾虑,降低总体成本,并有助于扩大项目库。本研究的 Python 代码示例可在 Github 上获取:https://github.com/hotakamaeda/ai_field_testing1。
{"title":"Field-Testing Multiple-Choice Questions With AI Examinees: English Grammar Items.","authors":"Hotaka Maeda","doi":"10.1177/00131644241281053","DOIUrl":"10.1177/00131644241281053","url":null,"abstract":"<p><p>Field-testing is an essential yet often resource-intensive step in the development of high-quality educational assessments. I introduce an innovative method for field-testing newly written exam items by substituting human examinees with artificially intelligent (AI) examinees. The proposed approach is demonstrated using 466 four-option multiple-choice English grammar questions. Pre-trained transformer language models are fine-tuned based on the 2-parameter logistic (2PL) item response model to respond like human test-takers. Each AI examinee is associated with a latent ability θ, and the item text is used to predict response selection probabilities for each of the four response options. For the best modeling approach identified, the overall correlation between the true and predicted 2PL correct response probabilities was .82 (bias = 0.00, root mean squared error = 0.18). The study results were promising, showing that item response data generated from AI can be used to calculate item proportion correct, item discrimination, conduct item calibration with anchors, distractor analysis, dimensionality analysis, and latent trait scoring. However, the proposed approach did not achieve the level of accuracy obtainable with human examinee response data. If further refined, potential resource savings in transitioning from human to AI field-testing could be enormous. AI could shorten the field-testing timeline, prevent examinees from seeing low-quality field-test items in real exams, shorten test lengths, eliminate test security, item exposure, and sample size concerns, reduce overall cost, and help expand the item bank. Example Python code from this study is available on Github: https://github.com/hotakamaeda/ai_field_testing1.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"221-244"},"PeriodicalIF":2.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562880/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations. 利用实验操作评估心理测试中速度与准确性的权衡。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-04-01 Epub Date: 2024-10-07 DOI: 10.1177/00131644241271309
Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl

The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (N = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.

速度-准确性权衡(SAT),即反应速度的提高往往会导致准确性的降低,这在实验心理学中已得到公认。然而,它对心理测评的影响,尤其是在高风险环境中的影响,仍然鲜为人知。本研究介绍了一种在高风险空间能力评估中研究 SAT 的实验方法。通过在主体内设计中操纵指令,诱导大量(N = 1305)空中交通管制员培训项目申请者的速度变化,我们证明了操纵工作速度的可行性。我们的研究结果证实了大多数参与者的 SAT 存在,这表明传统的能力分数可能无法完全反映高风险评估中的表现。重要的是,我们观察到了 SAT 的个体差异,这挑战了不同应试者 SAT 功能一致的假设。这些结果凸显了解释高风险评估结果的复杂性,以及考试条件对成绩动态的影响。这项研究为评估心理测试(包括 SAT 研究)中速度和准确性之间的个体内部关系提供了一个宝贵的方法工具包,提供了一种受控方法,同时承认有必要解决潜在的混杂因素。未来的研究可能会在不同的认知领域、人群和测试环境中应用这种方法,以加深我们对 SAT 对心理测量的广泛影响的理解。
{"title":"Assessing the Speed-Accuracy Tradeoff in Psychological Testing Using Experimental Manipulations.","authors":"Tobias Alfers, Georg Gittler, Esther Ulitzsch, Steffi Pohl","doi":"10.1177/00131644241271309","DOIUrl":"10.1177/00131644241271309","url":null,"abstract":"<p><p>The speed-accuracy tradeoff (SAT), where increased response speed often leads to decreased accuracy, is well established in experimental psychology. However, its implications for psychological assessments, especially in high-stakes settings, remain less understood. This study presents an experimental approach to investigate the SAT within a high-stakes spatial ability assessment. By manipulating instructions in a within-subjects design to induce speed variations in a large sample (<i>N</i> = 1,305) of applicants for an air traffic controller training program, we demonstrate the feasibility of manipulating working speed. Our findings confirm the presence of the SAT for most participants, suggesting that traditional ability scores may not fully reflect performance in high-stakes assessments. Importantly, we observed individual differences in the SAT, challenging the assumption of uniform SAT functions across test takers. These results highlight the complexity of interpreting high-stakes assessment outcomes and the influence of test conditions on performance dynamics. This study offers a valuable addition to the methodological toolkit for assessing the intraindividual relationship between speed and accuracy in psychological testing (including SAT research), providing a controlled approach while acknowledging the need to address potential confounders. Future research may apply this method across various cognitive domains, populations, and testing contexts to deepen our understanding of the SAT's broader implications for psychological measurement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"357-383"},"PeriodicalIF":2.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562887/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Educational and Psychological Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1