首页 > 最新文献

Educational and Psychological Measurement最新文献

英文 中文
The One-Parameter Logistic Model Can Be True With Zero Probability for a Unidimensional Measuring Instrument: How One Could Go Wrong Removing Items Not Satisfying the Model. 一维测量仪器的单参数Logistic模型可以零概率成立:移除不符合模型的项目如何出错?
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-06 DOI: 10.1177/00131644251345120
Tenko Raykov, Bingsheng Zhang

This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.

本文讨论一维多项目测量仪器的单参数logistic模型或Rasch模型成立的可能性。有人指出,如果一个单一的维度是由二分类项目组成的尺度的基础,那么任何一个模型对该尺度正确的概率都可以为零。接下来的问题是,删除不遵循这些模型的项目可能会产生什么后果。使用大量模拟数据集,提出了一对经验相关的设置,其中这种项目消除可能是有问题的。具体来说,由于项目不符合1pl模型或Rasch模型,从一维工具中删除项目可能会产生潜在的严重误导能力估计,并增加潜在特质的标准误差和预测误差。讨论了对教育和行为研究的启示。
{"title":"The One-Parameter Logistic Model Can Be True With Zero Probability for a Unidimensional Measuring Instrument: How One Could Go Wrong Removing Items Not Satisfying the Model.","authors":"Tenko Raykov, Bingsheng Zhang","doi":"10.1177/00131644251345120","DOIUrl":"10.1177/00131644251345120","url":null,"abstract":"<p><p>This note is concerned with the chance of the one-parameter logistic (1PL-) model or the Rasch model being true for a unidimensional multi-item measuring instrument. It is pointed out that if a single dimension underlies a scale consisting of dichotomous items, then the probability of either model being correct for that scale can be zero. The question is then addressed, what the consequences could be of removing items not following these models. Using a large number of simulated data sets, a pair of empirically relevant settings is presented where such item elimination can be problematic. Specifically, dropping items from a unidimensional instrument due to them not satisfying the 1PL-model, or the Rasch model, can yield potentially seriously misleading ability estimates with increased standard errors and prediction error with respect to the latent trait. Implications for educational and behavioral research are discussed.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251345120"},"PeriodicalIF":2.3,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12328337/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144816062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Model-Based Person Fit Statistics Applied to the Wechsler Adult Intelligence Scale IV. 基于模型的人拟合统计在韦氏成人智力量表中的应用
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-08-03 DOI: 10.1177/00131644251339444
Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder

An important task in clinical neuropsychology is to evaluate whether scores obtained on a test battery, such as the Wechsler Adult Intelligence Scale Fourth Edition (WAIS-IV), can be considered "credible" or "valid" for a particular patient. Such evaluations are typically made based on responses to performance validity tests (PVTs). As a complement to PVTs, we propose that WAIS-IV profiles also be evaluated using a residual-based M-distance ( d ri 2 ) person fit statistic. Large d ri 2 values flag profiles that are inconsistent with the factor analytic model underlying the interpretation of test scores. We first established a well-fitting model with four correlated factors for 10 core WAIS-IV subtests derived from the standardization sample. Based on this model, we then performed a Monte Carlo simulation to evaluate whether a hypothesized sampling distribution for d ri 2 was accurate and whether d ri 2 was computable, under different degrees of missing subtest scores. We found that when the number of subtests administered was less than 8, d ri 2 could not be computed around 25% of the time. When computable, d ri 2 conformed to a χ 2 distribution with degrees of freedom equal to the number of tests minus the number of factors. Demonstration of the d ri 2 index in a large sample of clinical cases was also provided. Findings highlight the potential utility of the d ri 2 index as an adjunct to PVTs, offering clinicians an additional method to evaluate WAIS-IV test profiles and improve the accuracy of neuropsychological evaluations.

临床神经心理学的一项重要任务是评估在一系列测试中获得的分数,如韦氏成人智力量表第四版(WAIS-IV),对于特定患者来说是否可以被认为是“可信的”或“有效的”。这种评估通常是基于对性能有效性测试(pvt)的响应进行的。作为pvt的补充,我们建议使用基于残差的m -距离(d ri 2)人拟合统计量来评估WAIS-IV剖面。大的dri 2值标志着与解释考试成绩的因素分析模型不一致的概况。首先,我们对标准化样本衍生的10个核心WAIS-IV子测试建立了具有4个相关因子的良好拟合模型。在此模型的基础上,我们进行了蒙特卡罗模拟,以评估在不同程度的缺失子测试分数下,d ri 2的假设抽样分布是否准确以及d ri 2是否可计算。我们发现,当进行的子测试数量少于8个时,大约25%的时间无法计算dri 2。当可计算时,dri 2符合χ 2分布,其自由度等于试验数减去因子数。还提供了在大量临床病例样本中对dri 2指数的演示。研究结果强调了d2指数作为pvt辅助指标的潜在效用,为临床医生提供了一种评估WAIS-IV测试资料的额外方法,并提高了神经心理学评估的准确性。
{"title":"Model-Based Person Fit Statistics Applied to the Wechsler Adult Intelligence Scale IV.","authors":"Jared M Block, Steven P Reise, Keith F Widaman, Amanda K Montoya, David W Loring, Laura Glass Umfleet, Russell M Bauer, Joseph M Gullett, Brittany Wolff, Daniel L Drane, Kristen Enriquez, Robert M Bilder","doi":"10.1177/00131644251339444","DOIUrl":"10.1177/00131644251339444","url":null,"abstract":"<p><p>An important task in clinical neuropsychology is to evaluate whether scores obtained on a test battery, such as the Wechsler Adult Intelligence Scale Fourth Edition (WAIS-IV), can be considered \"credible\" or \"valid\" for a particular patient. Such evaluations are typically made based on responses to performance validity tests (PVTs). As a complement to PVTs, we propose that WAIS-IV profiles also be evaluated using a residual-based M-distance ( <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> ) person fit statistic. Large <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> values flag profiles that are inconsistent with the factor analytic model underlying the interpretation of test scores. We first established a well-fitting model with four correlated factors for 10 core WAIS-IV subtests derived from the standardization sample. Based on this model, we then performed a Monte Carlo simulation to evaluate whether a hypothesized sampling distribution for <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was accurate and whether <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> was computable, under different degrees of missing subtest scores. We found that when the number of subtests administered was less than 8, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> could not be computed around 25% of the time. When computable, <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> conformed to a <math> <mrow> <msup><mrow><mi>χ</mi></mrow> <mrow><mn>2</mn></mrow> </msup> </mrow> </math> distribution with degrees of freedom equal to the number of tests minus the number of factors. Demonstration of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index in a large sample of clinical cases was also provided. Findings highlight the potential utility of the <math> <mrow> <msubsup><mrow><mi>d</mi></mrow> <mrow><mi>ri</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </mrow> </math> index as an adjunct to PVTs, offering clinicians an additional method to evaluate WAIS-IV test profiles and improve the accuracy of neuropsychological evaluations.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251339444"},"PeriodicalIF":2.3,"publicationDate":"2025-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12321812/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144793789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Disentangling Qualitatively Different Faking Strategies in High-Stakes Personality Assessments: A Mixture Extension of the Multidimensional Nominal Response Model. 高风险人格评估中不同品质伪装策略的解耦:多维名义反应模型的混合扩展。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-29 DOI: 10.1177/00131644251341843
Timo Seitz, Ö Emre C Alagöz, Thorsten Meiser

High-stakes personality assessments are often compromised by faking, where test-takers distort their responses according to social desirability. Many previous models have accounted for faking by modeling an additional latent dimension that quantifies each test-taker's degree of faking. Such models assume a homogeneous response strategy among all test-takers, reflected in a measurement model in which substantive traits and faking jointly influence item responses. However, such a model will be misspecified if, for some test-takers, item responding is only a function of substantive traits or only a function of faking. To address this limitation, we propose a mixture modeling extension of the multidimensional nominal response model (M-MNRM) that can be used to account for qualitatively different response strategies and to model relationships of strategy use with external variables. In a simulation study, the M-MNRM exhibited good parameter recovery and high classification accuracy across multiple conditions. Analyses of three empirical high-stakes datasets provided evidence for the consistent presence of the specified latent classes in different personnel selection contexts, emphasizing the importance of accounting for such kind of response behavior heterogeneity in high-stakes assessment data. We end the article with a discussion of the model's utility for psychological measurement.

高风险的人格评估通常会因伪造而受到损害,即考生根据社会期望扭曲自己的回答。以前的许多模型都是通过建立一个额外的潜在维度来量化每个考生的伪造程度。这些模型假设所有被试者的反应策略是同质的,反映在实质性特征和虚假特征共同影响项目反应的测量模型中。然而,对于一些考生来说,如果项目反应仅仅是实质性特征的函数或仅仅是伪造的函数,那么这种模型将是错误的。为了解决这一限制,我们提出了多维名义响应模型(M-MNRM)的混合建模扩展,该模型可用于解释定性不同的响应策略,并对策略使用与外部变量的关系进行建模。在仿真研究中,M-MNRM在多个条件下均表现出良好的参数恢复和较高的分类精度。对三个经验高风险数据集的分析提供了证据,证明特定潜在类别在不同的人员选择背景下一致存在,强调了在高风险评估数据中考虑这种反应行为异质性的重要性。最后,我们讨论了该模型在心理测量中的效用。
{"title":"Disentangling Qualitatively Different Faking Strategies in High-Stakes Personality Assessments: A Mixture Extension of the Multidimensional Nominal Response Model.","authors":"Timo Seitz, Ö Emre C Alagöz, Thorsten Meiser","doi":"10.1177/00131644251341843","DOIUrl":"10.1177/00131644251341843","url":null,"abstract":"<p><p>High-stakes personality assessments are often compromised by faking, where test-takers distort their responses according to social desirability. Many previous models have accounted for faking by modeling an additional latent dimension that quantifies each test-taker's degree of faking. Such models assume a homogeneous response strategy among all test-takers, reflected in a measurement model in which substantive traits and faking jointly influence item responses. However, such a model will be misspecified if, for some test-takers, item responding is only a function of substantive traits or only a function of faking. To address this limitation, we propose a mixture modeling extension of the multidimensional nominal response model (M-MNRM) that can be used to account for qualitatively different response strategies and to model relationships of strategy use with external variables. In a simulation study, the M-MNRM exhibited good parameter recovery and high classification accuracy across multiple conditions. Analyses of three empirical high-stakes datasets provided evidence for the consistent presence of the specified latent classes in different personnel selection contexts, emphasizing the importance of accounting for such kind of response behavior heterogeneity in high-stakes assessment data. We end the article with a discussion of the model's utility for psychological measurement.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251341843"},"PeriodicalIF":2.3,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144774941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Item Difficulty Modeling Using Fine-tuned Small and Large Language Models. 道具难度建模使用微调的大小语言模型。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-06 DOI: 10.1177/00131644251344973
Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz

This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models (LLMs). We introduce novel data augmentation strategies, including augmentation on the fly and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models (SLMs) such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa yielded lower root mean squared error than the first-place model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among SLMs enhanced prediction accuracy, reinforcing the benefits of ensemble learning. LLMs, such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.

本研究探讨了大型和小型语言模型(llm)在大规模评估中项目难度建模的方法。我们引入了新的数据增强策略,包括动态增强和分布平衡,这些策略超越了基准性能,证明了它们在缓解数据不平衡和提高模型性能方面的有效性。我们的研究结果表明,在BEA 2024共享任务竞赛中,经过微调的小型语言模型(SLMs),如来自Transformers的双向编码器表示(BERT)和RoBERTa,产生的均方根误差低于第一名的模型,而特定领域的模型,如BioClinicalBERT和PubMedBERT,由于分布差距没有提供显着的改进。slm中的多数投票提高了预测精度,强化了集成学习的好处。lms,如GPT-4,表现出较强的泛化能力,但在项目难度预测方面存在困难,这可能是由于有限的训练数据和缺乏明确的难度相关背景。研究人员探索了思维链提示和基本原理生成方法,但没有产生实质性的改进,这表明可能需要额外的训练数据或更复杂的推理技术。基于嵌入的方法,特别是使用NV-Embed-v2的方法,表现出了希望,但并没有超过我们最好的增强策略,这表明捕捉细微的困难相关特征仍然是一个挑战。
{"title":"Item Difficulty Modeling Using Fine-tuned Small and Large Language Models.","authors":"Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz","doi":"10.1177/00131644251344973","DOIUrl":"10.1177/00131644251344973","url":null,"abstract":"<p><p>This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models (LLMs). We introduce novel data augmentation strategies, including augmentation on the fly and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models (SLMs) such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa yielded lower root mean squared error than the first-place model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among SLMs enhanced prediction accuracy, reinforcing the benefits of ensemble learning. LLMs, such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251344973"},"PeriodicalIF":2.1,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12230038/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144590702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Historical Measurement Information Can Be Used to Improve Estimation of Structural Parameters in Structural Equation Models With Small Samples. 利用历史测量信息可以改善小样本结构方程模型中结构参数的估计。
IF 2.1 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-13 DOI: 10.1177/00131644251330851
James Ohisei Uanhoro, Olushola O Soyoye

This study investigates the incorporation of historical measurement information into structural equation models (SEM) with small samples to enhance the estimation of structural parameters. Given the availability of published factor analysis results with loading estimates and standard errors for popular scales, researchers may use this historical information as informative priors in Bayesian SEM (BSEM). We focus on estimating the correlation between two constructs using BSEM after generating data with significant bias in the Pearson correlation of their sum scores due to measurement error. Our findings indicate that incorporating historical information on measurement parameters as priors can improve the accuracy of correlation estimates, mainly when the true correlation is small-a common scenario in psychological research. Priors derived from meta-analytic estimates were especially effective, providing high accuracy and acceptable coverage. However, when the true correlation is large, weakly informative priors on all parameters yield the best results. These results suggest leveraging historical measurement information in BSEM can enhance structural parameter estimation.

本研究探讨了将历史测量信息纳入小样本结构方程模型(SEM)以增强结构参数的估计。考虑到已发表的因子分析结果的可用性,研究人员可以使用这些历史信息作为贝叶斯扫描电镜(BSEM)的信息先验。我们的重点是在生成由于测量误差导致的总得分的Pearson相关性存在显著偏差的数据后,使用BSEM估计两个结构之间的相关性。我们的研究结果表明,将测量参数的历史信息作为先验可以提高相关估计的准确性,特别是当真实相关性很小时-这是心理学研究中的常见情况。来自元分析估计的先验尤其有效,提供了高准确性和可接受的覆盖率。然而,当真正的相关性很大时,所有参数的弱信息先验产生最好的结果。这些结果表明,利用历史测量信息在BSEM中可以提高结构参数的估计。
{"title":"Historical Measurement Information Can Be Used to Improve Estimation of Structural Parameters in Structural Equation Models With Small Samples.","authors":"James Ohisei Uanhoro, Olushola O Soyoye","doi":"10.1177/00131644251330851","DOIUrl":"10.1177/00131644251330851","url":null,"abstract":"<p><p>This study investigates the incorporation of historical measurement information into structural equation models (SEM) with small samples to enhance the estimation of structural parameters. Given the availability of published factor analysis results with loading estimates and standard errors for popular scales, researchers may use this historical information as informative priors in Bayesian SEM (BSEM). We focus on estimating the correlation between two constructs using BSEM after generating data with significant bias in the Pearson correlation of their sum scores due to measurement error. Our findings indicate that incorporating historical information on measurement parameters as priors can improve the accuracy of correlation estimates, mainly when the true correlation is small-a common scenario in psychological research. Priors derived from meta-analytic estimates were especially effective, providing high accuracy and acceptable coverage. However, when the true correlation is large, weakly informative priors on all parameters yield the best results. These results suggest leveraging historical measurement information in BSEM can enhance structural parameter estimation.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251330851"},"PeriodicalIF":2.1,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12170579/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144324766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions. 用IRTree方法建模缺失数据对不同仿真条件下参数估计的影响。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2024-12-23 DOI: 10.1177/00131644241306024
Yeşim Beril Soğuksu, Ergül Demir

This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.

本研究探讨了项目响应树(IRTree)方法在缺失数据建模中的性能,并将其性能与期望最大化(EM)算法和多重imputation (MI)方法进行了比较。模拟和经验数据用于评估这些方法在不同的缺失数据机制,测试长度,样本量和缺失数据比例。期望后验法用于能力估计,并计算偏差和均方根误差(RMSE)。研究结果表明,与EM和MI方法相比,IRTree方法提供了更准确的能力估计,RMSE更低。在完全随机缺失和非随机缺失两种情况下,特别是在测试时间较长和缺失数据比例较低的情况下,其总体性能特别强。然而,IRTree对中等水平的遗漏答案和中等能力的考生最有效,尽管在极端遗漏和能力的情况下,其准确性会下降。该研究强调,IRTree特别适合于低风险测试,并且在提供对数据集中潜在缺失数据机制的更深入了解方面具有强大的潜力。
{"title":"The Effect of Modeling Missing Data With IRTree Approach on Parameter Estimates Under Different Simulation Conditions.","authors":"Yeşim Beril Soğuksu, Ergül Demir","doi":"10.1177/00131644241306024","DOIUrl":"10.1177/00131644241306024","url":null,"abstract":"<p><p>This study explores the performance of the item response tree (IRTree) approach in modeling missing data, comparing its performance to the expectation-maximization (EM) algorithm and multiple imputation (MI) methods. Both simulation and empirical data were used to evaluate these methods across different missing data mechanisms, test lengths, sample sizes, and missing data proportions. Expected a posteriori was used for ability estimation, and bias and root mean square error (RMSE) were calculated. The findings indicate that IRTree provides more accurate ability estimates with lower RMSE than both EM and MI methods. Its overall performance was particularly strong under missing completely at random and missing not at random, especially with longer tests and lower proportions of missing data. However, IRTree was most effective with moderate levels of omitted responses and medium-ability test takers, though its accuracy decreased in cases of extreme omissions and abilities. The study highlights that IRTree is particularly well suited for low-stakes tests and has strong potential for providing deeper insights into the underlying missing data mechanisms within a data set.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"507-526"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142892972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks. 基于功能主成分聚类和神经网络的项目难度分类。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2025-01-04 DOI: 10.1177/00131644241299834
James Zoucha, Igor Himelfarb, Nai-En Tang

Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved very easy to easy items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.

要准确、公平地将考生划分为及格或不及格类别,在各种测试表格中保持项目难度的一致性至关重要。本文介绍了一种利用功能数据分析(FDA)根据难度水平对项目进行分类的实用程序。在方法上,我们通过分析项目特征曲线(ICC)的功能主成分(FPC),将其聚类为难度组,然后采用神经网络预测 ICC 的难度。鉴于许多 ICC 之间的相似程度,按难度对项目进行分类可能具有挑战性。这种方法的优势在于它能够为项目分类提供一个经验性和一致性的过程,而不是仅仅依靠目测。研究结果表明,目测分类与 FDA 结果之间的大多数差异仅相差一个难度等级。在这些差异中,约有 67% 的差异涉及中等至较难的项目被 FDA 归类为较高难度级别,而其余三分之一的差异则涉及非常简单至简单的项目被归类为较低难度级别。在这些数据上训练的神经网络的准确率达到了 79.6%,与 FDA 聚类相比,误分类的难度等级也只相差一个。该方法展示了一种高效实用的测试项目分类程序,尤其适用于在全年不同时间对较少数量的考生进行测试的测试项目。
{"title":"Item Classification by Difficulty Using Functional Principal Component Clustering and Neural Networks.","authors":"James Zoucha, Igor Himelfarb, Nai-En Tang","doi":"10.1177/00131644241299834","DOIUrl":"10.1177/00131644241299834","url":null,"abstract":"<p><p>Maintaining consistent item difficulty across test forms is crucial for accurately and fairly classifying examinees into pass or fail categories. This article presents a practical procedure for classifying items based on difficulty levels using functional data analysis (FDA). Methodologically, we clustered item characteristic curves (ICCs) into difficulty groups by analyzing their functional principal components (FPCs) and then employed a neural network to predict difficulty for ICCs. Given the degree of similarity between many ICCs, categorizing items by difficulty can be challenging. The strength of this method lies in its ability to provide an empirical and consistent process for item classification, as opposed to relying solely on visual inspection. The findings reveal that most discrepancies between visual classification and FDA results differed by only one adjacent difficulty level. Approximately 67% of these discrepancies involved items in the medium to hard range being categorized into higher difficulty levels by FDA, while the remaining third involved <i>very easy</i> to <i>easy</i> items being classified into lower levels. The neural network, trained on these data, achieved an accuracy of 79.6%, with misclassifications also differing by only one adjacent difficulty level compared to FDA clustering. The method demonstrates an efficient and practical procedure for classifying test items, especially beneficial in testing programs where smaller volumes of examinees tested at various times throughout the year.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"429-457"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11699546/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142930042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impact of Missing Data on Parameter Estimation: Three Examples in Computerized Adaptive Testing. 缺失数据对参数估计的影响:计算机自适应测试中的三个例子。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2025-01-07 DOI: 10.1177/00131644241306990
Xiaowen Liu, Eric Loken

In computerized adaptive testing (CAT), examinees see items targeted to their ability level. Postoperational data have a high degree of missing information relative to designs where everyone answers all questions. Item responses are observed over a restricted range of abilities, reducing item-total score correlations. However, if the adaptive item selection depends only on observed responses, the data are missing at random (MAR). We simulated data from three different testing designs (common items, randomly selected items, and CAT) and found that it was possible to re-estimate both person and item parameters from postoperational CAT data. In a multidimensional CAT, we show that it is necessary to include all responses from the testing phase to avoid violating missing data assumptions. We also observed that some CAT designs produced "reversals" where item discriminations became negative causing dramatic under and over-estimation of abilities. Our results apply to situations where researchers work with data drawn from adaptive testing or from instructional tools with adaptive delivery. To avoid bias, researchers must make sure they use all the data necessary to meet the MAR assumptions.

在计算机自适应测试(CAT)中,考生看到的是针对他们能力水平的项目。相对于每个人回答所有问题的设计,操作后数据有高度的信息缺失。道具反应是在有限的能力范围内观察到的,这降低了道具与总分的相关性。然而,如果自适应项目选择仅依赖于观察到的反应,则数据随机缺失(MAR)。我们模拟了三种不同测试设计(常见项目、随机选择项目和CAT)的数据,发现可以从术后CAT数据中重新估计人和项目参数。在多维CAT中,我们表明有必要包括来自测试阶段的所有响应,以避免违反缺失的数据假设。我们还观察到一些CAT设计产生了“逆转”,其中项目歧视变得消极,导致对能力的严重低估和高估。我们的研究结果适用于研究人员使用自适应测试或自适应交付的教学工具得出的数据的情况。为了避免偏见,研究人员必须确保他们使用所有必要的数据来满足MAR假设。
{"title":"The Impact of Missing Data on Parameter Estimation: Three Examples in Computerized Adaptive Testing.","authors":"Xiaowen Liu, Eric Loken","doi":"10.1177/00131644241306990","DOIUrl":"10.1177/00131644241306990","url":null,"abstract":"<p><p>In computerized adaptive testing (CAT), examinees see items targeted to their ability level. Postoperational data have a high degree of missing information relative to designs where everyone answers all questions. Item responses are observed over a restricted range of abilities, reducing item-total score correlations. However, if the adaptive item selection depends only on observed responses, the data are missing at random (MAR). We simulated data from three different testing designs (common items, randomly selected items, and CAT) and found that it was possible to re-estimate both person and item parameters from postoperational CAT data. In a multidimensional CAT, we show that it is necessary to include all responses from the testing phase to avoid violating missing data assumptions. We also observed that some CAT designs produced \"reversals\" where item discriminations became negative causing dramatic under and over-estimation of abilities. Our results apply to situations where researchers work with data drawn from adaptive testing or from instructional tools with adaptive delivery. To avoid bias, researchers must make sure they use all the data necessary to meet the MAR assumptions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"617-635"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11705310/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142946372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Treating Noneffortful Responses as Missing. 将不费力的回应视为缺失。
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2024-11-29 DOI: 10.1177/00131644241297925
Christine E DeMars

This study investigates the treatment of rapid-guess (RG) responses as missing data within the context of the effort-moderated model. Through a series of illustrations, this study demonstrates that the effort-moderated model assumes missing at random (MAR) rather than missing completely at random (MCAR), explaining the conditions necessary for MAR. These examples show that RG responses, when treated as missing under the effort-moderated model, do not introduce bias into ability estimates if the missingness mechanism is properly accounted for. Conversely, using a standard item response theory (IRT) model (scoring RG responses as if they were valid) instead of the effort-moderated model leads to considerable biases, underestimating group means and overestimating standard deviations when the item parameters are known, or overestimating item difficulty if the item parameters are estimated.

本研究探讨了在努力调节模型的背景下,快速猜测(RG)反应作为缺失数据的处理。通过一系列的例子,本研究表明,努力调节模型假设随机缺失(MAR)而不是完全随机缺失(MCAR),解释了随机缺失的必要条件。这些例子表明,如果缺失机制得到适当考虑,RG反应在努力调节模型下被视为缺失时,不会在能力估计中引入偏差。相反,使用标准项目反应理论(IRT)模型(将RG反应视为有效)而不是努力调节模型会导致相当大的偏差,当项目参数已知时低估了群体均值,高估了标准偏差,或者如果项目参数是估计的,则高估了项目难度。
{"title":"Treating Noneffortful Responses as Missing.","authors":"Christine E DeMars","doi":"10.1177/00131644241297925","DOIUrl":"10.1177/00131644241297925","url":null,"abstract":"<p><p>This study investigates the treatment of rapid-guess (RG) responses as missing data within the context of the effort-moderated model. Through a series of illustrations, this study demonstrates that the effort-moderated model assumes missing at random (MAR) rather than missing completely at random (MCAR), explaining the conditions necessary for MAR. These examples show that RG responses, when treated as missing under the effort-moderated model, do not introduce bias into ability estimates if the missingness mechanism is properly accounted for. Conversely, using a standard item response theory (IRT) model (scoring RG responses as if they were valid) instead of the effort-moderated model leads to considerable biases, underestimating group means and overestimating standard deviations when the item parameters are known, or overestimating item difficulty if the item parameters are estimated.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"589-616"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11607706/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142767511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs. 获得稳定动态拟合指数临界值的最佳重复次数
IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-06-01 Epub Date: 2024-11-08 DOI: 10.1177/00131644241290172
Xinran Liu, Daniel McNeish

Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.

因子分析常用于行为科学中潜在结构的测量,研究人员通常会考虑近似的拟合指数,以确保模型充分拟合并提供重要的有效性证据。由于缺乏通用的拟合指数临界值,方法论专家建议采用基于模拟的方法来创建自定义临界值,以便研究人员更准确地评估模型拟合度。然而,基于模拟的方法需要大量计算。一个悬而未决的问题是:需要多少次模拟重复才能使这些自定义截断值趋于稳定?这项蒙特卡洛模拟研究主要针对这样一种基于模拟的方法--动态拟合指数(DFI)截断值,以确定获得稳定截断值的最佳重复次数。研究结果表明,DFI 方法可以通过 500 次重复(目前推荐的次数)生成稳定的临界值,但如果重复次数更少,这一过程的效率会更高,尤其是在使用分类数据进行模拟时。使用更少的重复次数可以大大减少确定临界值的计算时间,而对结果的影响却很小。对于单因素或三因素模型,结果表明,在大多数情况下,200 次 DFI 重复是兼顾拟合指数临界值稳定性和计算效率的最佳选择。
{"title":"Optimal Number of Replications for Obtaining Stable Dynamic Fit Index Cutoffs.","authors":"Xinran Liu, Daniel McNeish","doi":"10.1177/00131644241290172","DOIUrl":"10.1177/00131644241290172","url":null,"abstract":"<p><p>Factor analysis is commonly used in behavioral sciences to measure latent constructs, and researchers routinely consider approximate fit indices to ensure adequate model fit and to provide important validity evidence. Due to a lack of generalizable fit index cutoffs, methodologists suggest simulation-based methods to create customized cutoffs that allow researchers to assess model fit more accurately. However, simulation-based methods are computationally intensive. An open question is: How many simulation replications are needed for these custom cutoffs to stabilize? This Monte Carlo simulation study focuses on one such simulation-based method-dynamic fit index (DFI) cutoffs-to determine the optimal number of replications for obtaining stable cutoffs. Results indicated that the DFI approach generates stable cutoffs with 500 replications (the currently recommended number), but the process can be more efficient with fewer replications, especially in simulations with categorical data. Using fewer replications significantly reduces the computational time for determining cutoff values with minimal impact on the results. For one-factor or three-factor models, results suggested that in most conditions 200 DFI replications were optimal for balancing fit index cutoff stability and computational efficiency.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"539-564"},"PeriodicalIF":2.3,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11562945/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Educational and Psychological Measurement
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1