Pub Date : 2026-02-06DOI: 10.1177/01466216261420758
Jonas Bjermo, Ellinor Fackle Fornius, Frank Miller
Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.
{"title":"Optimal Item Calibration in the Context of the Swedish Scholastic Aptitude Test.","authors":"Jonas Bjermo, Ellinor Fackle Fornius, Frank Miller","doi":"10.1177/01466216261420758","DOIUrl":"https://doi.org/10.1177/01466216261420758","url":null,"abstract":"<p><p>Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261420758"},"PeriodicalIF":1.2,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12880929/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146144195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1177/01466216261415631
Guangming Li
<p><p>The Markov chain Monte Carlo (MCMC) method is more and more widely used to estimate variance components in generalizability theory (GT). However, as an essential part of MCMC method, uninformative priors haven't been explored and different GT researches vary in the use of uninformative priors. This study focused on effect of the different uninformative priors on estimating variance components. Based on <i>p × i × r</i> design, eight uninformative prior distributions were chosen for simulation study and empirical study, including <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 1], <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 2], <math> <mrow> <msup><mrow><mo> </mo> <mi>σ</mi></mrow> <mn>2</mn></msup> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>1000</mn></mrow> <mo>)</mo></mrow> </mrow> </math> <b>[</b>prior 3<b>]</b>, <math><mrow><mi>σ</mi> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>100</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 4], <math><mrow><mi>log</mi> <mo></mo> <mrow><mo>(</mo> <msup><mi>σ</mi> <mn>2</mn></msup> <mo>)</mo></mrow> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mo>-</mo> <mn>10</mn> <mo>,</mo> <mn>10</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 5], <math> <mrow><mfrac><mn>1</mn> <msup><mi>σ</mi> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>p</mi> <mi>a</mi> <mi>r</mi> <mi>e</mi> <mi>t</mi> <mi>o</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> <mo> </mo> <mrow><mo>[</mo> <mrow><mtext>prior</mtext> <mo> </mo> <mn>6</mn></mrow> <mo>]</mo></mrow> </mrow> </math> , <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mo>(</mo> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>+</mo> <msup><mi>τ</mi> <mn>2</mn></msup> </mrow> <mo>)</mo></mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi></mrow> </math> [prior 7], and <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mn>2</mn> <mi>τ</mi> <mrow><mo>(</mo> <mrow><mi>σ</mi> <mo>+</mo> <mi>τ</mi></mrow> <mo>)</mo></mrow> </mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 8
马尔可夫链蒙特卡罗(MCMC)方法是泛化理论中越来越广泛使用的方差分量估计方法。然而,作为MCMC方法的一个重要组成部分,非信息先验尚未得到研究,不同的GT研究对非信息先验的使用也各不相同。本文主要研究了不同的非信息先验对方差成分估计的影响。基于p××r设计、八不提供信息的先验分布为仿真研究和实证研究,选择包括σ2∼我n v - g m m(0.001, 0.001)[1]之前,σ2∼我n v - g m m(1, 1)[2]之前,σ2∼u n i f o r m(0.001, 1000)[3]之前,σ∼u n i f o r m(0, 100)[4]之前,日志(σ2)∼u n i f o r m(- 10、10)[5]之前,1σ2∼p r e t o(0.001)之前[6],σ2(σ2 +τ2)2∼u n i f o r m[7]之前,和σ 22 τ (σ + τ) 2 ~ u n I f or m(0,1)[先验8]。并计算了完整数据和10%缺失/稀疏数据的三个后验点估计(即均值、中位数和众数)。经过仿真研究和实证研究,结果表明:(1)σ 2 ~ in v - g a m ma (0.001, 0.001) [prior 1]在大多数情况下的后验点估计性能最好且更稳定,而1 σ 2 ~ p ar o (1,0.001) [prior 6]总是最差的后验点估计;(2)不同方法的差异主要体现在方差分量σ i 2和σ r 2上,先验6存在明显的极值偏差,极值偏差最大可达281.09和167.59;(3)后验均值估计总是产生最大的偏差,但后验中值估计是最好的;(4)当方差分量的水平数较小时,无信息先验间方差分量的估计差异较大;(5)完整数据与10%缺失/稀疏数据的结果基本相同。少量的缺失/稀疏数据对结果的影响很小。这8个发行版的运行时间从489.78秒到692.58秒不等,彼此之间差别不大。
{"title":"Influence of Uninformative Prior Distributions for MCMC Method on Estimating Variance Components in Generalizability Theory.","authors":"Guangming Li","doi":"10.1177/01466216261415631","DOIUrl":"10.1177/01466216261415631","url":null,"abstract":"<p><p>The Markov chain Monte Carlo (MCMC) method is more and more widely used to estimate variance components in generalizability theory (GT). However, as an essential part of MCMC method, uninformative priors haven't been explored and different GT researches vary in the use of uninformative priors. This study focused on effect of the different uninformative priors on estimating variance components. Based on <i>p × i × r</i> design, eight uninformative prior distributions were chosen for simulation study and empirical study, including <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 1], <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 2], <math> <mrow> <msup><mrow><mo> </mo> <mi>σ</mi></mrow> <mn>2</mn></msup> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>1000</mn></mrow> <mo>)</mo></mrow> </mrow> </math> <b>[</b>prior 3<b>]</b>, <math><mrow><mi>σ</mi> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>100</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 4], <math><mrow><mi>log</mi> <mo></mo> <mrow><mo>(</mo> <msup><mi>σ</mi> <mn>2</mn></msup> <mo>)</mo></mrow> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mo>-</mo> <mn>10</mn> <mo>,</mo> <mn>10</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 5], <math> <mrow><mfrac><mn>1</mn> <msup><mi>σ</mi> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>p</mi> <mi>a</mi> <mi>r</mi> <mi>e</mi> <mi>t</mi> <mi>o</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> <mo> </mo> <mrow><mo>[</mo> <mrow><mtext>prior</mtext> <mo> </mo> <mn>6</mn></mrow> <mo>]</mo></mrow> </mrow> </math> , <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mo>(</mo> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>+</mo> <msup><mi>τ</mi> <mn>2</mn></msup> </mrow> <mo>)</mo></mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi></mrow> </math> [prior 7], and <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mn>2</mn> <mi>τ</mi> <mrow><mo>(</mo> <mrow><mi>σ</mi> <mo>+</mo> <mi>τ</mi></mrow> <mo>)</mo></mrow> </mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 8","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261415631"},"PeriodicalIF":1.2,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867738/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146126804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1177/01466216261420305
Xiaozhu Jian, Buyun Dai, Yeqi Qing, YuanPing Deng
This study presents a novel extension of the weighted score logistic model (WSLM). The WSLM is an advancement of the traditional dichotomous logistic model that incorporates an additional weighted score parameter. This model is specifically designed to analyze non-continuous category scored polytomous items in educational and psychological testing contexts. Within the WSLM framework, the mean difficulty parameter reflects the overall item difficulty, while both discrimination and mean difficulty parameters are estimated using marginal maximum likelihood estimation. A Monte Carlo simulation study was conducted to evaluate the performance of the WSLM, which demonstrated low levels of bias and root mean square error (RMSE) of item parameters, indicative of accurate parameter recovery. Under most simulation conditions, the fit statistics Q1 and Q4 for polytomous items under the WSLM remained below their respective critical chi-square values, suggesting acceptable model-data fit. These results support the applicability and robustness of the WSLM in practical assessment settings involving complex scoring schemes.
{"title":"Estimating and Fitting the Non-continuous category scored Polytomous Items under the Weighted Score Logistic Model and its Simulation Study.","authors":"Xiaozhu Jian, Buyun Dai, Yeqi Qing, YuanPing Deng","doi":"10.1177/01466216261420305","DOIUrl":"https://doi.org/10.1177/01466216261420305","url":null,"abstract":"<p><p>This study presents a novel extension of the weighted score logistic model (WSLM). The WSLM is an advancement of the traditional dichotomous logistic model that incorporates an additional weighted score parameter. This model is specifically designed to analyze non-continuous category scored polytomous items in educational and psychological testing contexts. Within the WSLM framework, the mean difficulty parameter reflects the overall item difficulty, while both discrimination and mean difficulty parameters are estimated using marginal maximum likelihood estimation. A Monte Carlo simulation study was conducted to evaluate the performance of the WSLM, which demonstrated low levels of bias and root mean square error (RMSE) of item parameters, indicative of accurate parameter recovery. Under most simulation conditions, the fit statistics Q1 and Q4 for polytomous items under the WSLM remained below their respective critical chi-square values, suggesting acceptable model-data fit. These results support the applicability and robustness of the WSLM in practical assessment settings involving complex scoring schemes.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261420305"},"PeriodicalIF":1.2,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12854999/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1177/01466216261416025
Jari Metsämuuronen
Cohen's d is the most commonly used estimator to quantify the magnitude of the difference between the means of two subpopulations. When comparing multiple populations simultaneously, Cohen's f can be used for the same purpose. Using their relationship in the dichotomous setting, several general formulas for d are derived that generalize d to the polytomous setting. The traditional simplified estimator d = 2f is studied as a shortcut estimator. It is strongly recommended to use the general formulas instead of the simplified ones when assessing the magnitude of the effect size, especially when the discrepancy of the extreme proportions of cases in the subpopulations exceeds 0.40.
{"title":"Generalized Cohen's d for Multiple Means and Polytomous Settings.","authors":"Jari Metsämuuronen","doi":"10.1177/01466216261416025","DOIUrl":"10.1177/01466216261416025","url":null,"abstract":"<p><p>Cohen's <i>d</i> is the most commonly used estimator to quantify the magnitude of the difference between the means of two subpopulations. When comparing multiple populations simultaneously, Cohen's <i>f</i> can be used for the same purpose. Using their relationship in the dichotomous setting, several general formulas for <i>d</i> are derived that generalize <i>d</i> to the polytomous setting. The traditional simplified estimator <i>d</i> = 2<i>f</i> is studied as a shortcut estimator. It is strongly recommended to use the general formulas instead of the simplified ones when assessing the magnitude of the effect size, especially when the discrepancy of the extreme proportions of cases in the subpopulations exceeds 0.40.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261416025"},"PeriodicalIF":1.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12819128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1177/01466216251415011
Yale Quan, Chun Wang
Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.
{"title":"Calibrating Multidimensional Assessments With Structural Missingness: An Application of a Multiple-Group Higher-Order IRT Model.","authors":"Yale Quan, Chun Wang","doi":"10.1177/01466216251415011","DOIUrl":"10.1177/01466216251415011","url":null,"abstract":"<p><p>Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251415011"},"PeriodicalIF":1.2,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12779540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1177/01466216251415189
Sean Joo, Philseok Lee, Stephen Stark
The field of psychometrics has made remarkable progress in developing item response theory (IRT) models for analyzing multidimensional forced choice (MFC) measures. This study introduces an innovative method that enhances the latent trait estimation of the Multi-Unidimensional Pairwise Preference (MUPP) model by incorporating latent regression modeling. To validate the efficacy of the new method, we conducted a comprehensive simulation study. The results of the study provide compelling evidence that the proposed latent regression MUPP (LR-MUPP) model significantly improves the accuracy of the latent trait estimation. This study opens new avenues for future research and encourages further development and refinement of MFC IRT models and their applications.
{"title":"Improving Latent Trait Estimation in Multidimensional Forced Choice Measures: Latent Regression Multi-Unidimensional Pairwise Preference Model.","authors":"Sean Joo, Philseok Lee, Stephen Stark","doi":"10.1177/01466216251415189","DOIUrl":"10.1177/01466216251415189","url":null,"abstract":"<p><p>The field of psychometrics has made remarkable progress in developing item response theory (IRT) models for analyzing multidimensional forced choice (MFC) measures. This study introduces an innovative method that enhances the latent trait estimation of the Multi-Unidimensional Pairwise Preference (MUPP) model by incorporating latent regression modeling. To validate the efficacy of the new method, we conducted a comprehensive simulation study. The results of the study provide compelling evidence that the proposed latent regression MUPP (LR-MUPP) model significantly improves the accuracy of the latent trait estimation. This study opens new avenues for future research and encourages further development and refinement of MFC IRT models and their applications.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251415189"},"PeriodicalIF":1.2,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12764422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145907257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1177/01466216251401214
Jianbin Fu, Xuan Tan, Patrick C Kyllonen
Theoretically, the generalized graded unfolding model (GGUM) is more flexible than the generalized partial credit model (GPCM), a dominance model. For item responses generated by the GPCM, the GGUM estimations can generate overlapping item response curves with those from the GPCM over a range of latent trait scores covering almost all of the population. The discrimination and category threshold estimates from the two models are approximately equal. It is necessary to use an informative prior around an extreme location (e.g., 4 for a positive GPCM item) or fix the extreme locations in the GGUM estimation of GPCM items to achieve the desired estimation. The simulation study and the applications on two real datasets support the theoretical claims. Various practical implications are discussed, and suggestions for future research are provided.
{"title":"Can the Generalized Graded Unfolding Model Fit Dominance Responses?","authors":"Jianbin Fu, Xuan Tan, Patrick C Kyllonen","doi":"10.1177/01466216251401214","DOIUrl":"10.1177/01466216251401214","url":null,"abstract":"<p><p>Theoretically, the generalized graded unfolding model (GGUM) is more flexible than the generalized partial credit model (GPCM), a dominance model. For item responses generated by the GPCM, the GGUM estimations can generate overlapping item response curves with those from the GPCM over a range of latent trait scores covering almost all of the population. The discrimination and category threshold estimates from the two models are approximately equal. It is necessary to use an informative prior around an extreme location (e.g., 4 for a positive GPCM item) or fix the extreme locations in the GGUM estimation of GPCM items to achieve the desired estimation. The simulation study and the applications on two real datasets support the theoretical claims. Various practical implications are discussed, and suggestions for future research are provided.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251401214"},"PeriodicalIF":1.2,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12682685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145716376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1177/01466216251401213
Domenic Groh
The Test-Retest Coefficient (TRC) is a central metric of reliability in Classical Test Theory and modern psychological assessments. Originally developed by early 20th-century psychometricians, it relies on the assumptions of fixed (i.e., perfectly stable) true scores and independent error scores. However, these assumptions are rarely, if ever, tested, despite the fact that their violation can introduce significant biases. This article explores the foundations of these assumptions and examines the performance of the TRC under varying conditions, including different sample sizes, true score stability, and error score dependence. Using simulated data, results show that decreasing true score stability biases TRC estimates, leading to underestimations of reliability. Additionally, error score dependence can inflate TRC values, making unreliable measures appear reliable. More fundamentally, when these assumptions are violated, the TRC becomes underidentified, meaning that multiple, substantively different data-generating processes can yield the same coefficient, thus undermining its interpretability. These findings call into question the TRC's suitability for applied settings, especially when traits fluctuate over time or measurement conditions are uncontrolled. Alternative approaches are briefly discussed.
{"title":"On the Unreliability of Test-Retest Reliability.","authors":"Domenic Groh","doi":"10.1177/01466216251401213","DOIUrl":"10.1177/01466216251401213","url":null,"abstract":"<p><p>The Test-Retest Coefficient (TRC) is a central metric of reliability in Classical Test Theory and modern psychological assessments. Originally developed by early 20th-century psychometricians, it relies on the assumptions of fixed (i.e., perfectly stable) true scores and independent error scores. However, these assumptions are rarely, if ever, tested, despite the fact that their violation can introduce significant biases. This article explores the foundations of these assumptions and examines the performance of the TRC under varying conditions, including different sample sizes, true score stability, and error score dependence. Using simulated data, results show that decreasing true score stability biases TRC estimates, leading to underestimations of reliability. Additionally, error score dependence can inflate TRC values, making unreliable measures appear reliable. More fundamentally, when these assumptions are violated, the TRC becomes underidentified, meaning that multiple, substantively different data-generating processes can yield the same coefficient, thus undermining its interpretability. These findings call into question the TRC's suitability for applied settings, especially when traits fluctuate over time or measurement conditions are uncontrolled. Alternative approaches are briefly discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251401213"},"PeriodicalIF":1.2,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12657207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145649801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-24DOI: 10.1177/01466216251401206
Sooyong Lee, Suyoung Kim, Seung W Choi
Ensuring measurement invariance is crucial for fair psychological and educational assessments, particularly in detecting Differential Item Functioning (DIF). Moderated Non-linear Factor Analysis (MNLFA) provides a flexible framework for detecting DIF by modeling item parameters as functions of observed covariates. However, a significant challenge in MNLFA-based DIF detection is anchor item selection, as improperly chosen anchors can bias results. This study proposes a refined constrained-baseline anchor detection approach utilizing information criteria (IC) for model selection. The proposed three-step procedure sequentially identifies potential DIF items through the Bayesian Information Criterion (BIC) and Weighted Information Criterion (WIC), followed by DIF-free anchor items using the Akaike Information Criterion (AIC). The method's effectiveness in controlling Type I error rates while maintaining statistical power is evaluated through simulation studies and empirical data analysis. Comparisons with regularization approaches demonstrate the proposed method's accuracy and computational efficiency.
{"title":"Anchor Detection Strategy in Moderated Non-Linear Factor Analysis for Differential Item Functioning (DIF).","authors":"Sooyong Lee, Suyoung Kim, Seung W Choi","doi":"10.1177/01466216251401206","DOIUrl":"https://doi.org/10.1177/01466216251401206","url":null,"abstract":"<p><p>Ensuring measurement invariance is crucial for fair psychological and educational assessments, particularly in detecting Differential Item Functioning (DIF). Moderated Non-linear Factor Analysis (MNLFA) provides a flexible framework for detecting DIF by modeling item parameters as functions of observed covariates. However, a significant challenge in MNLFA-based DIF detection is anchor item selection, as improperly chosen anchors can bias results. This study proposes a refined constrained-baseline anchor detection approach utilizing information criteria (IC) for model selection. The proposed three-step procedure sequentially identifies potential DIF items through the Bayesian Information Criterion (BIC) and Weighted Information Criterion (WIC), followed by DIF-free anchor items using the Akaike Information Criterion (AIC). The method's effectiveness in controlling Type I error rates while maintaining statistical power is evaluated through simulation studies and empirical data analysis. Comparisons with regularization approaches demonstrate the proposed method's accuracy and computational efficiency.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251401206"},"PeriodicalIF":1.2,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12643905/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145641338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-13DOI: 10.1177/01466216251379471
Martijn Schoenmakers, Maria Bolsinova, Jesper Tijmstra
Extreme and midpoint response styles have frequently been found to decrease the validity of Likert-type questionnaire results. Different approaches for modelling extreme and midpoint responding have been proposed in the literature, with some advocating for a unidimensional conceptualization of the response styles as opposite poles, and others modelling them as separate dimensions. How these response styles are modelled influences the estimation complexity, parameter estimates, and detection of and correction for response styles in IRT models. For these reasons, we examine if it is possible to empirically distinguish between extreme and midpoint responding as two separate dimensions versus two opposite sides of a single dimension. The various conceptualizations are modelled using the multidimensional nominal response model, with the AIC and BIC being used to distinguish between the competing models in a simulation study and an empirical example. Results indicate good performance of both information criteria given sufficient sample size, test length, and response style strength. The BIC outperformed the AIC in cases where no response styles were present, while the AIC outperformed the BIC in cases where multiple response style dimensions were present. Implications of the results for practice are discussed.
{"title":"Distinguishing Between Models for Extreme and Midpoint Response Styles as Opposite Poles of a Single Dimension versus Two Separate Dimensions: A Simulation Study.","authors":"Martijn Schoenmakers, Maria Bolsinova, Jesper Tijmstra","doi":"10.1177/01466216251379471","DOIUrl":"10.1177/01466216251379471","url":null,"abstract":"<p><p>Extreme and midpoint response styles have frequently been found to decrease the validity of Likert-type questionnaire results. Different approaches for modelling extreme and midpoint responding have been proposed in the literature, with some advocating for a unidimensional conceptualization of the response styles as opposite poles, and others modelling them as separate dimensions. How these response styles are modelled influences the estimation complexity, parameter estimates, and detection of and correction for response styles in IRT models. For these reasons, we examine if it is possible to empirically distinguish between extreme and midpoint responding as two separate dimensions versus two opposite sides of a single dimension. The various conceptualizations are modelled using the multidimensional nominal response model, with the AIC and BIC being used to distinguish between the competing models in a simulation study and an empirical example. Results indicate good performance of both information criteria given sufficient sample size, test length, and response style strength. The BIC outperformed the AIC in cases where no response styles were present, while the AIC outperformed the BIC in cases where multiple response style dimensions were present. Implications of the results for practice are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251379471"},"PeriodicalIF":1.2,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12433433/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145070752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}