Pub Date : 2025-03-24DOI: 10.1177/01466216251325644
W Holmes Finch, Cihan Demir, Brian F French, Thao Vo
Applied and simulation studies document model convergence and accuracy issues in differential item functioning detection with multilevel models, hindering detection. This study aimed to evaluate the effectiveness of various estimation techniques in addressing these issues and ensure robust DIF detection. We conducted a simulation study to investigate the performance of multilevel logistic regression models with predictors at level 2 across different estimation procedures, including maximum likelihood estimation (MLE), Bayesian estimation, and generalized estimating equations (GEE). The simulation results demonstrated that all maintained control over the Type I error rate across conditions. In most cases, GEE had comparable or higher power compared to MLE for identifying DIF, with Bayes having the lowest power. When potentially important covariates at levels-1 and 2 were included in the model, power for all methods was higher. These results suggest that in many cases where multilevel logistic regression is used for DIF detection, GEE offers a viable option for researchers and that including important contextual variables at all levels of the data is desirable. Implications for practice are discussed.
{"title":"Accuracy in Invariance Detection With Multilevel Models With Three Estimators.","authors":"W Holmes Finch, Cihan Demir, Brian F French, Thao Vo","doi":"10.1177/01466216251325644","DOIUrl":"10.1177/01466216251325644","url":null,"abstract":"<p><p>Applied and simulation studies document model convergence and accuracy issues in differential item functioning detection with multilevel models, hindering detection. This study aimed to evaluate the effectiveness of various estimation techniques in addressing these issues and ensure robust DIF detection. We conducted a simulation study to investigate the performance of multilevel logistic regression models with predictors at level 2 across different estimation procedures, including maximum likelihood estimation (MLE), Bayesian estimation, and generalized estimating equations (GEE). The simulation results demonstrated that all maintained control over the Type I error rate across conditions. In most cases, GEE had comparable or higher power compared to MLE for identifying DIF, with Bayes having the lowest power. When potentially important covariates at levels-1 and 2 were included in the model, power for all methods was higher. These results suggest that in many cases where multilevel logistic regression is used for DIF detection, GEE offers a viable option for researchers and that including important contextual variables at all levels of the data is desirable. Implications for practice are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251325644"},"PeriodicalIF":1.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11948245/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143755115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-24DOI: 10.1177/01466216251330305
Marie Wiberg, Inga Laukaityte
Test score equating is used to make scores from different test forms comparable, even when groups differ in ability. In practice, the non-equivalent group with anchor test (NEAT) design is commonly used. The overall aim was to compare the amount of bias under different conditions when using either chained equating or frequency estimation with five different criterion functions: the identity function, linear equating, equipercentile, chained equating and frequency estimation. We used real test data from a multiple-choice binary scored college admissions test to illustrate that the choice of criterion function matter. Further, we simulated data in line with the empirical data to examine difference in ability between groups, difference in item difficulty, difference in anchor test form and regular test form length, difference in correlations between anchor test form and regular test forms, and different sample size. The results indicate that how bias is defined heavily affects the conclusions we draw about which equating method is to be preferred in different scenarios. Practical implications of this in standardized tests are given together with recommendations on how to calculate bias when evaluating equating transformations.
{"title":"Calculating Bias in Test Score Equating in a NEAT Design.","authors":"Marie Wiberg, Inga Laukaityte","doi":"10.1177/01466216251330305","DOIUrl":"10.1177/01466216251330305","url":null,"abstract":"<p><p>Test score equating is used to make scores from different test forms comparable, even when groups differ in ability. In practice, the non-equivalent group with anchor test (NEAT) design is commonly used. The overall aim was to compare the amount of bias under different conditions when using either chained equating or frequency estimation with five different criterion functions: the identity function, linear equating, equipercentile, chained equating and frequency estimation. We used real test data from a multiple-choice binary scored college admissions test to illustrate that the choice of criterion function matter. Further, we simulated data in line with the empirical data to examine difference in ability between groups, difference in item difficulty, difference in anchor test form and regular test form length, difference in correlations between anchor test form and regular test forms, and different sample size. The results indicate that how bias is defined heavily affects the conclusions we draw about which equating method is to be preferred in different scenarios. Practical implications of this in standardized tests are given together with recommendations on how to calculate bias when evaluating equating transformations.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251330305"},"PeriodicalIF":1.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11948241/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143755122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-11DOI: 10.1177/01466216251324938
Lawrence T DeCarlo
The MC-DINA model is a cognitive diagnosis model (CDM) for multiple-choice items that was introduced by de la Torre (2009). The model extends the usual CDM in two basic ways: it allows for nominal responses instead of only dichotomous responses, and it allows skills to affect not only the choice of the correct response but also the choice of distractors. Here it is shown that the model can be re-expressed as a multinomial logit model with latent discrete predictors, that is, as a multinomial mixture model; a signal detection-like parameterization is also used. The reparameterization clarifies details about the structure and assumptions of the model, especially with respect to distractors, and helps to reveal parameter restrictions, which in turn have implications for psychological interpretations of the data and for issues with respect to statistical estimation. The approach suggests parsimonious models that are useful for practical applications, particularly for small sample sizes. The restrictions are shown to appear for items from the TIMSS 2007 fourth grade exam.
MC-DINA模型是de la Torre(2009)提出的多选题认知诊断模型(CDM)。该模型以两种基本方式扩展了通常的CDM:它允许名义反应,而不仅仅是二分反应;它允许技能不仅影响正确反应的选择,还影响干扰因素的选择。结果表明,该模型可以重新表示为具有潜在离散预测因子的多项logit模型,即多项混合模型;还使用了类似信号检测的参数化。重新参数化澄清了关于模型结构和假设的细节,特别是关于干扰因素,并有助于揭示参数限制,这反过来又对数据的心理解释和统计估计方面的问题产生影响。这种方法提出了对实际应用有用的精简模型,特别是对小样本量。这些限制出现在TIMSS 2007年四年级考试的项目中。
{"title":"On a Reparameterization of the MC-DINA Model.","authors":"Lawrence T DeCarlo","doi":"10.1177/01466216251324938","DOIUrl":"10.1177/01466216251324938","url":null,"abstract":"<p><p>The MC-DINA model is a cognitive diagnosis model (CDM) for multiple-choice items that was introduced by de la Torre (2009). The model extends the usual CDM in two basic ways: it allows for nominal responses instead of only dichotomous responses, and it allows skills to affect not only the choice of the correct response but also the choice of distractors. Here it is shown that the model can be re-expressed as a multinomial logit model with latent discrete predictors, that is, as a multinomial mixture model; a signal detection-like parameterization is also used. The reparameterization clarifies details about the structure and assumptions of the model, especially with respect to distractors, and helps to reveal parameter restrictions, which in turn have implications for psychological interpretations of the data and for issues with respect to statistical estimation. The approach suggests parsimonious models that are useful for practical applications, particularly for small sample sizes. The restrictions are shown to appear for items from the TIMSS 2007 fourth grade exam.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251324938"},"PeriodicalIF":1.0,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897991/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143626591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-02DOI: 10.1177/01466216251322285
Jesper Tijmstra, Maria Bolsinova
When using Likert scales, the inclusion of a middle-category response option poses a challenge for the valid measurement of the psychological attribute of interest. While this middle category is often included to provide respondents with a neutral response option, respondents may in practice also select this category when they do not want to or cannot give an informative response. If one analyzes the response data without considering these two possible uses of the middle response category, measurement may be confounded. In this paper, we propose a response-mixture IRTree model for the analysis of Likert-scale data. This model acknowledges that the middle response category can either be selected as a non-response option (and hence be uninformative for the attribute of interest) or to communicate a neutral position (and hence be informative), and that this choice depends on both person- and item-characteristics. For each observed middle-category response, the probability that it was intended to be informative is modeled, and both the attribute of substantive interest and a non-response tendency are estimated. The performance of the model is evaluated in a simulation study, and the procedure is applied to empirical data from personality psychology.
{"title":"Modeling Within- and Between-Person Differences in the Use of the Middle Category in Likert Scales.","authors":"Jesper Tijmstra, Maria Bolsinova","doi":"10.1177/01466216251322285","DOIUrl":"10.1177/01466216251322285","url":null,"abstract":"<p><p>When using Likert scales, the inclusion of a middle-category response option poses a challenge for the valid measurement of the psychological attribute of interest. While this middle category is often included to provide respondents with a neutral response option, respondents may in practice also select this category when they do not want to or cannot give an informative response. If one analyzes the response data without considering these two possible uses of the middle response category, measurement may be confounded. In this paper, we propose a response-mixture IRTree model for the analysis of Likert-scale data. This model acknowledges that the middle response category can either be selected as a non-response option (and hence be uninformative for the attribute of interest) or to communicate a neutral position (and hence be informative), and that this choice depends on both person- and item-characteristics. For each observed middle-category response, the probability that it was intended to be informative is modeled, and both the attribute of substantive interest and a non-response tendency are estimated. The performance of the model is evaluated in a simulation study, and the procedure is applied to empirical data from personality psychology.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251322285"},"PeriodicalIF":1.0,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11873858/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01DOI: 10.1177/01466216251322353
Nicholas Trout, Kylie Gorney
Romero et al. (2015; see also Wollack, 1997) developed the ω statistic as a method for detecting unusually similar answers between pairs of examinees. For each pair, the ω statistic considers whether the observed number of similar answers is significantly larger than the expected number of similar answers. However, one limitation of ω is that it does not account for the particular items on which similar answers are observed. Therefore, in this study, we propose a weighted version of the ω statistic that takes this information into account. We compare the performance of the new and existing statistics using detailed simulations in which several factors are manipulated. Results show that while both the new and existing statistics are able to control the Type I error rate, the new statistic is more powerful, on average.
Romero et al. (2015;另见Wollack, 1997)发展了ω统计作为一种方法来检测异常相似的答案对考生之间。对于每一对,ω统计量考虑观察到的相似答案的数量是否显著大于预期的相似答案的数量。然而,ω的一个限制是,它没有考虑到观察到类似答案的特定项目。因此,在本研究中,我们提出了一个考虑到这些信息的ω统计量的加权版本。我们使用几个因素被操纵的详细模拟来比较新的和现有的统计数据的性能。结果表明,虽然新的和现有的统计量都能够控制第一类错误率,但平均而言,新的统计量更强大。
{"title":"Weighted Answer Similarity Analysis.","authors":"Nicholas Trout, Kylie Gorney","doi":"10.1177/01466216251322353","DOIUrl":"10.1177/01466216251322353","url":null,"abstract":"<p><p>Romero et al. (2015; see also Wollack, 1997) developed the <i>ω</i> statistic as a method for detecting unusually similar answers between pairs of examinees. For each pair, the <i>ω</i> statistic considers whether the observed number of similar answers is significantly larger than the expected number of similar answers. However, one limitation of <i>ω</i> is that it does not account for the particular items on which similar answers are observed. Therefore, in this study, we propose a weighted version of the <i>ω</i> statistic that takes this information into account. We compare the performance of the new and existing statistics using detailed simulations in which several factors are manipulated. Results show that while both the new and existing statistics are able to control the Type I error rate, the new statistic is more powerful, on average.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251322353"},"PeriodicalIF":1.0,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11873304/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2024-10-15DOI: 10.1177/01466216241291233
Jonas Bjermo
The design of an achievement test is crucial for many reasons. This article focuses on a population's ability growth between school grades. We define design as the allocating of test items concerning the difficulties. The objective is to present an optimal test design method for estimating the mean and percentile ability growth with good precision. We use the asymptotic expression of the variance in terms of the test information. With that criterion for optimization, we propose to use particle swarm optimization to find the optimal design. The results show that the allocation of the item difficulties depends on item discrimination and the magnitude of the ability growth. The optimization function depends on the examinees' abilities, hence, the value of the unknown mean ability growth. Therefore, we will also use an optimum in-average design and conclude that it is robust to uncertainty in the mean ability growth. A test is, in practice, assembled from items stored in an item pool with calibrated item parameters. Hence, we also perform a discrete optimization using simulated annealing and compare the results to the particle swarm optimization.
{"title":"Optimal Test Design for Estimation of Mean Ability Growth.","authors":"Jonas Bjermo","doi":"10.1177/01466216241291233","DOIUrl":"10.1177/01466216241291233","url":null,"abstract":"<p><p>The design of an achievement test is crucial for many reasons. This article focuses on a population's ability growth between school grades. We define design as the allocating of test items concerning the difficulties. The objective is to present an optimal test design method for estimating the mean and percentile ability growth with good precision. We use the asymptotic expression of the variance in terms of the test information. With that criterion for optimization, we propose to use particle swarm optimization to find the optimal design. The results show that the allocation of the item difficulties depends on item discrimination and the magnitude of the ability growth. The optimization function depends on the examinees' abilities, hence, the value of the unknown mean ability growth. Therefore, we will also use an optimum in-average design and conclude that it is robust to uncertainty in the mean ability growth. A test is, in practice, assembled from items stored in an item pool with calibrated item parameters. Hence, we also perform a discrete optimization using simulated annealing and compare the results to the particle swarm optimization.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"29-49"},"PeriodicalIF":1.2,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560061/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142630381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2024-10-10DOI: 10.1177/01466216241284418
Hans-Friedrich Köhn, Chia-Yi Chiu, Olasumbo Oluwalana, Hyunjoo Kim, Jiaxi Wang
Cognitive Diagnosis Models in educational measurement are restricted latent class models that describe ability in a knowledge domain as a composite of latent skills an examinee may have mastered or failed. Different combinations of skills define distinct latent proficiency classes to which examinees are assigned based on test performance. Items of cognitively diagnostic assessments are characterized by skill profiles specifying which skills are required for a correct item response. The item-skill profiles of a test form its Q-matrix. The validity of cognitive diagnosis depends crucially on the correct specification of the Q-matrix. Typically, Q-matrices are determined by curricular experts. However, expert judgment is fallible. Data-driven estimation methods have been developed with the promise of greater accuracy in identifying the Q-matrix of a test. Yet, many of the extant methods encounter computational feasibility issues either in the form of excessive amounts of CPU times or inadmissible estimates. In this article, a two-step algorithm for estimating the Q-matrix is proposed that can be used with any cognitive diagnosis model. Simulations showed that the new method outperformed extant estimation algorithms and was computationally more efficient. It was also applied to Tatsuoka's famous fraction-subtraction data. The paper concludes with a discussion of theoretical and practical implications of the findings.
{"title":"A Two-Step Q-Matrix Estimation Method.","authors":"Hans-Friedrich Köhn, Chia-Yi Chiu, Olasumbo Oluwalana, Hyunjoo Kim, Jiaxi Wang","doi":"10.1177/01466216241284418","DOIUrl":"10.1177/01466216241284418","url":null,"abstract":"<p><p>Cognitive Diagnosis Models in educational measurement are restricted latent class models that describe ability in a knowledge domain as a composite of latent skills an examinee may have mastered or failed. Different combinations of skills define distinct latent proficiency classes to which examinees are assigned based on test performance. Items of cognitively diagnostic assessments are characterized by skill profiles specifying which skills are required for a correct item response. The item-skill profiles of a test form its Q-matrix. The validity of cognitive diagnosis depends crucially on the correct specification of the Q-matrix. Typically, Q-matrices are determined by curricular experts. However, expert judgment is fallible. Data-driven estimation methods have been developed with the promise of greater accuracy in identifying the Q-matrix of a test. Yet, many of the extant methods encounter computational feasibility issues either in the form of excessive amounts of CPU times or inadmissible estimates. In this article, a two-step algorithm for estimating the Q-matrix is proposed that can be used with any cognitive diagnosis model. Simulations showed that the new method outperformed extant estimation algorithms and was computationally more efficient. It was also applied to Tatsuoka's famous fraction-subtraction data. The paper concludes with a discussion of theoretical and practical implications of the findings.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"3-28"},"PeriodicalIF":1.2,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11560062/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142630379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-01Epub Date: 2024-10-21DOI: 10.1177/01466216241291237
Laixu Shang, Ping-Feng Xu, Na Shan, Man-Lai Tang, Qian-Zhen Zheng
One of the main concerns in multidimensional item response theory (MIRT) is to detect the relationship between items and latent traits, which can be treated as a latent variable selection problem. An attractive method for latent variable selection in multidimensional 2-parameter logistic (M2PL) model is to minimize the observed Bayesian information criterion (BIC) by the expectation model selection (EMS) algorithm. The EMS algorithm extends the EM algorithm and allows the updates of the model (e.g., the loading structure in MIRT) in the iterations along with the parameters under the model. As an extension of the M2PL model, the multidimensional 3-parameter logistic (M3PL) model introduces an additional guessing parameter which makes the latent variable selection more challenging. In this paper, a well-designed EMS algorithm, named improved EMS (IEMS), is proposed to accurately and efficiently detect the underlying true loading structure in the M3PL model, which also works for the M2PL model. In simulation studies, we compare the IEMS algorithm with several state-of-art methods and the IEMS is of competitiveness in terms of model recovery, estimation precision, and computational efficiency. The IEMS algorithm is illustrated by its application to two real data sets.
{"title":"The Improved EMS Algorithm for Latent Variable Selection in M3PL Model.","authors":"Laixu Shang, Ping-Feng Xu, Na Shan, Man-Lai Tang, Qian-Zhen Zheng","doi":"10.1177/01466216241291237","DOIUrl":"10.1177/01466216241291237","url":null,"abstract":"<p><p>One of the main concerns in multidimensional item response theory (MIRT) is to detect the relationship between items and latent traits, which can be treated as a latent variable selection problem. An attractive method for latent variable selection in multidimensional 2-parameter logistic (M2PL) model is to minimize the observed Bayesian information criterion (BIC) by the expectation model selection (EMS) algorithm. The EMS algorithm extends the EM algorithm and allows the updates of the model (e.g., the loading structure in MIRT) in the iterations along with the parameters under the model. As an extension of the M2PL model, the multidimensional 3-parameter logistic (M3PL) model introduces an additional guessing parameter which makes the latent variable selection more challenging. In this paper, a well-designed EMS algorithm, named improved EMS (IEMS), is proposed to accurately and efficiently detect the underlying true loading structure in the M3PL model, which also works for the M2PL model. In simulation studies, we compare the IEMS algorithm with several state-of-art methods and the IEMS is of competitiveness in terms of model recovery, estimation precision, and computational efficiency. The IEMS algorithm is illustrated by its application to two real data sets.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"50-70"},"PeriodicalIF":1.2,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11559968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142630392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-26DOI: 10.1177/01466216251322646
Maryam Pezeshki, Susan Embretson
To maintain test quality, a large supply of items is typically desired. Automatic item generation can result in a reduction in cost and labor, especially if the generated items have predictable item parameters and thus possibly reducing or eliminating the need for empirical tryout. However, the effect of different levels of item parameter predictability on the accuracy of trait estimation using item response theory models is unclear. If predictability is lower, adding response time as a collateral source of information may mitigate the effect on trait estimation accuracy. The present study investigates the impact of varying item parameter predictability on trait estimation accuracy, along with the impact of adding response time as a collateral source of information. Results indicated that trait estimation accuracy using item family model-based item parameters differed only slightly from using known item parameters. Somewhat larger trait estimation errors resulted from using cognitive complexity features to predict item parameters. Further, adding response times to the model resulted in more accurate trait estimation for tests with lower item difficulty levels (e.g., achievement tests). Implications for item generation and response processes aspect of validity are discussed.
{"title":"Impact of Parameter Predictability and Joint Modeling of Response Accuracy and Response Time on Ability Estimates.","authors":"Maryam Pezeshki, Susan Embretson","doi":"10.1177/01466216251322646","DOIUrl":"https://doi.org/10.1177/01466216251322646","url":null,"abstract":"<p><p>To maintain test quality, a large supply of items is typically desired. Automatic item generation can result in a reduction in cost and labor, especially if the generated items have predictable item parameters and thus possibly reducing or eliminating the need for empirical tryout. However, the effect of different levels of item parameter predictability on the accuracy of trait estimation using item response theory models is unclear. If predictability is lower, adding response time as a collateral source of information may mitigate the effect on trait estimation accuracy. The present study investigates the impact of varying item parameter predictability on trait estimation accuracy, along with the impact of adding response time as a collateral source of information. Results indicated that trait estimation accuracy using item family model-based item parameters differed only slightly from using known item parameters. Somewhat larger trait estimation errors resulted from using cognitive complexity features to predict item parameters. Further, adding response times to the model resulted in more accurate trait estimation for tests with lower item difficulty levels (e.g., achievement tests). Implications for item generation and response processes aspect of validity are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251322646"},"PeriodicalIF":1.0,"publicationDate":"2025-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11866334/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143543104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-20DOI: 10.1177/01466216251320403
Nate R Smith, Lisa A Keller, Richard A Feinberg, Chunyan Liu
Item preknowledge refers to the case where examinees have advanced knowledge of test material prior to taking the examination. When examinees have item preknowledge, the scores that result from those item responses are not true reflections of the examinee's proficiency. Further, this contamination in the data also has an impact on the item parameter estimates and therefore has an impact on scores for all examinees, regardless of whether they had prior knowledge. To ensure the validity of test scores, it is essential to identify both issues: compromised items (CIs) and examinees with preknowledge (EWPs). In some cases, the CIs are known, and the task is reduced to determining the EWPs. However, due to the potential threat to validity, it is critical for high-stakes testing programs to have a process for routinely monitoring for evidence of EWPs, often when CIs are unknown. Further, even knowing that specific items may have been compromised does not guarantee that any examinees had prior access to those items, or that those examinees that did have prior access know how to effectively use the preknowledge. Therefore, this paper attempts to use response behavior to identify item preknowledge without knowledge of which items may or may not have been compromised. While most research in this area has relied on traditional psychometric models, we investigate the utility of an unsupervised machine learning algorithm, extended isolation forest (EIF), to detect EWPs. Similar to previous research, the response behavior being analyzed is response time (RT) and response accuracy (RA).
{"title":"Few and Different: Detecting Examinees With Preknowledge Using Extended Isolation Forests.","authors":"Nate R Smith, Lisa A Keller, Richard A Feinberg, Chunyan Liu","doi":"10.1177/01466216251320403","DOIUrl":"10.1177/01466216251320403","url":null,"abstract":"<p><p>Item preknowledge refers to the case where examinees have advanced knowledge of test material prior to taking the examination. When examinees have item preknowledge, the scores that result from those item responses are not true reflections of the examinee's proficiency. Further, this contamination in the data also has an impact on the item parameter estimates and therefore has an impact on scores for all examinees, regardless of whether they had prior knowledge. To ensure the validity of test scores, it is essential to identify both issues: compromised items (CIs) and examinees with preknowledge (EWPs). In some cases, the CIs are known, and the task is reduced to determining the EWPs. However, due to the potential threat to validity, it is critical for high-stakes testing programs to have a process for routinely monitoring for evidence of EWPs, often when CIs are unknown. Further, even knowing that specific items may have been compromised does not guarantee that any examinees had prior access to those items, or that those examinees that did have prior access know how to effectively use the preknowledge. Therefore, this paper attempts to use response behavior to identify item preknowledge without knowledge of which items may or may not have been compromised. While most research in this area has relied on traditional psychometric models, we investigate the utility of an unsupervised machine learning algorithm, extended isolation forest (EIF), to detect EWPs. Similar to previous research, the response behavior being analyzed is response time (RT) and response accuracy (RA).</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251320403"},"PeriodicalIF":1.0,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11843570/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143484553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}