Pub Date : 2026-02-13DOI: 10.1177/01466216261425242
Cody Ding
Survey questionnaires are essential tools in psychological and educational research, as the data they gather directly influence research conclusions and policy decisions. A major challenge in ensuring data quality is identifying aberrant response patterns that can jeopardize research outcomes, as they may introduce errors into subsequent analyses, potentially resulting in flawed theoretical conclusions and misguided practical applications. This study presents a machine learning solution that employs autoencoder neural networks to detect aberrant response patterns in survey data as a computational method. We evaluated the effectiveness of autoencoder neural networks in identifying response anomalies through both simulated and real data. The results indicate that this approach can effectively detect anomalies in responses, providing researchers with more options for their analyses and subsequent conclusions. Ultimately, this enhances the trustworthiness of findings in psychological and educational research.
{"title":"Rise of the Machine: Detecting Aberrant Response Patterns in Survey Instruments Using Autoencoder.","authors":"Cody Ding","doi":"10.1177/01466216261425242","DOIUrl":"10.1177/01466216261425242","url":null,"abstract":"<p><p>Survey questionnaires are essential tools in psychological and educational research, as the data they gather directly influence research conclusions and policy decisions. A major challenge in ensuring data quality is identifying aberrant response patterns that can jeopardize research outcomes, as they may introduce errors into subsequent analyses, potentially resulting in flawed theoretical conclusions and misguided practical applications. This study presents a machine learning solution that employs autoencoder neural networks to detect aberrant response patterns in survey data as a computational method. We evaluated the effectiveness of autoencoder neural networks in identifying response anomalies through both simulated and real data. The results indicate that this approach can effectively detect anomalies in responses, providing researchers with more options for their analyses and subsequent conclusions. Ultimately, this enhances the trustworthiness of findings in psychological and educational research.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261425242"},"PeriodicalIF":1.2,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12904810/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-12DOI: 10.1177/01466216261425440
Kyung Yong Kim, Seongeun Kim, Haeju Lee
Item response theory (IRT) observed and true score equating are often conducted assuming that the latent variable is normally distributed. Although this might be a reasonable assumption for many educational and psychological assessments, not all variables can be approximated by a normal distribution. Under the common-item nonequivalent groups design, the current study examined the impact of latent density misspecification on IRT observed and true score equating. Specifically, equating results provided by two separate calibration estimates based on the Stocking-Lord linking method with normal and uniform weights and three concurrent calibration estimates obtained with different characterizations of the latent densities for the old and new groups were compared using both simulated and real data sets. In general, the concurrent calibration method with the latent densities for the two groups estimated using the empirical histogram method provided equating results with the least amount of error for most of the study conditions. Using normal weights with the Stocking-Lord method generally performed much better than using uniform weights; however, the overall performance of the Stocking-Lord method with normal weights was acceptable only if the latent densities for the two groups were normal distributions or close to normal distributions.
{"title":"The Impact of Latent Density Misspecification on Item Response Theory Equating Methods.","authors":"Kyung Yong Kim, Seongeun Kim, Haeju Lee","doi":"10.1177/01466216261425440","DOIUrl":"10.1177/01466216261425440","url":null,"abstract":"<p><p>Item response theory (IRT) observed and true score equating are often conducted assuming that the latent variable is normally distributed. Although this might be a reasonable assumption for many educational and psychological assessments, not all variables can be approximated by a normal distribution. Under the common-item nonequivalent groups design, the current study examined the impact of latent density misspecification on IRT observed and true score equating. Specifically, equating results provided by two separate calibration estimates based on the Stocking-Lord linking method with normal and uniform weights and three concurrent calibration estimates obtained with different characterizations of the latent densities for the old and new groups were compared using both simulated and real data sets. In general, the concurrent calibration method with the latent densities for the two groups estimated using the empirical histogram method provided equating results with the least amount of error for most of the study conditions. Using normal weights with the Stocking-Lord method generally performed much better than using uniform weights; however, the overall performance of the Stocking-Lord method with normal weights was acceptable only if the latent densities for the two groups were normal distributions or close to normal distributions.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261425440"},"PeriodicalIF":1.2,"publicationDate":"2026-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12900660/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1177/01466216261422480
Rudolf Debelak, Charles C Driver
We present a fast, score-based test to detecting model misspecification in item response theory (IRT) models that remains valid when person parameters are treated as fixed effects, as may be used for very large data sets. The new approximation (i) eliminates the need to pre-specify ability groups or priors for person abilities, (ii) does not require explicit functional form assumptions, (iii) works with two estimators designed for very high item/person counts-constrained joint maximum likelihood (CJML) and joint maximum a posteriori (JMAP)-and (iv) requires only a single model fit, making DIF-screening faster and simpler than alternatives based on model comparisons. A spline-based residualization step further suppresses spurious Type I error when the ordering covariate is correlated with ability. Simulations with the two-parameter logistic model show nominal error rates and high power once examinees contribute around 15-20 responses; only extremely short tests (around 10 items) still pose challenges under strong impact. An application to 1,602 reading items and 57,684 students from the Mindsteps platform demonstrates scalability and practical value, flagging 13% of items for gender-related DIF and correlating highly with conventional approaches of explicitly modeling DIF. Together, these results position the proposed test as a robust, computation-light diagnostic for large-scale assessments when classical random-effects approaches are infeasible, ability group structure is unknown or complex, or the shape of DIF effects is unknown or complex.
{"title":"Score-Based Tests With Fixed Effects Person Parameters in Item Response Theory: Detecting Model Misspecification Including Differential Item Functioning.","authors":"Rudolf Debelak, Charles C Driver","doi":"10.1177/01466216261422480","DOIUrl":"10.1177/01466216261422480","url":null,"abstract":"<p><p>We present a fast, score-based test to detecting model misspecification in item response theory (IRT) models that remains valid when person parameters are treated as fixed effects, as may be used for very large data sets. The new approximation (i) eliminates the need to pre-specify ability groups or priors for person abilities, (ii) does not require explicit functional form assumptions, (iii) works with two estimators designed for very high item/person counts-constrained joint maximum likelihood (CJML) and joint maximum a posteriori (JMAP)-and (iv) requires only a single model fit, making DIF-screening faster and simpler than alternatives based on model comparisons. A spline-based residualization step further suppresses spurious Type I error when the ordering covariate is correlated with ability. Simulations with the two-parameter logistic model show nominal error rates and high power once examinees contribute around 15-20 responses; only extremely short tests (around 10 items) still pose challenges under strong impact. An application to 1,602 reading items and 57,684 students from the <i>Mindsteps</i> platform demonstrates scalability and practical value, flagging 13% of items for gender-related DIF and correlating highly with conventional approaches of explicitly modeling DIF. Together, these results position the proposed test as a robust, computation-light diagnostic for large-scale assessments when classical random-effects approaches are infeasible, ability group structure is unknown or complex, or the shape of DIF effects is unknown or complex.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261422480"},"PeriodicalIF":1.2,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12890607/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146182799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1177/01466216261420758
Jonas Bjermo, Ellinor Fackle Fornius, Frank Miller
Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.
{"title":"Optimal Item Calibration in the Context of the Swedish Scholastic Aptitude Test.","authors":"Jonas Bjermo, Ellinor Fackle Fornius, Frank Miller","doi":"10.1177/01466216261420758","DOIUrl":"10.1177/01466216261420758","url":null,"abstract":"<p><p>Large-scale achievement tests require the existence of item banks with items for use in future tests. Before an item is included into the bank, its characteristics need to be estimated. The process of estimating the item characteristics is called item calibration. For the quality of the future achievement tests, it is important to perform this calibration well and it is desirable to estimate the item characteristics as efficiently as possible. Methods of optimal design have been developed to allocate pretest items to examinees with the most suited ability. Theoretical evidence shows advantages with using ability-dependent allocation of pretest items. However, it is not clear whether these theoretical results hold also in a real testing situation. In this paper, we investigate the performance of an optimal ability-dependent allocation in the context of the Swedish Scholastic Aptitude Test (SweSAT) and quantify the gain from using the optimal allocation. On average over all items, we see an improved precision of calibration. While this average improvement is moderate, we are able to identify for what kind of items the method works well. This enables targeting specific item types for optimal calibration. We also discuss possibilities for improvements of the method.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261420758"},"PeriodicalIF":1.2,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12880929/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146144195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1177/01466216261415631
Guangming Li
<p><p>The Markov chain Monte Carlo (MCMC) method is more and more widely used to estimate variance components in generalizability theory (GT). However, as an essential part of MCMC method, uninformative priors haven't been explored and different GT researches vary in the use of uninformative priors. This study focused on effect of the different uninformative priors on estimating variance components. Based on <i>p × i × r</i> design, eight uninformative prior distributions were chosen for simulation study and empirical study, including <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 1], <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 2], <math> <mrow> <msup><mrow><mo> </mo> <mi>σ</mi></mrow> <mn>2</mn></msup> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>1000</mn></mrow> <mo>)</mo></mrow> </mrow> </math> <b>[</b>prior 3<b>]</b>, <math><mrow><mi>σ</mi> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>100</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 4], <math><mrow><mi>log</mi> <mo></mo> <mrow><mo>(</mo> <msup><mi>σ</mi> <mn>2</mn></msup> <mo>)</mo></mrow> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mo>-</mo> <mn>10</mn> <mo>,</mo> <mn>10</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 5], <math> <mrow><mfrac><mn>1</mn> <msup><mi>σ</mi> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>p</mi> <mi>a</mi> <mi>r</mi> <mi>e</mi> <mi>t</mi> <mi>o</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> <mo> </mo> <mrow><mo>[</mo> <mrow><mtext>prior</mtext> <mo> </mo> <mn>6</mn></mrow> <mo>]</mo></mrow> </mrow> </math> , <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mo>(</mo> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>+</mo> <msup><mi>τ</mi> <mn>2</mn></msup> </mrow> <mo>)</mo></mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi></mrow> </math> [prior 7], and <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mn>2</mn> <mi>τ</mi> <mrow><mo>(</mo> <mrow><mi>σ</mi> <mo>+</mo> <mi>τ</mi></mrow> <mo>)</mo></mrow> </mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 8
马尔可夫链蒙特卡罗(MCMC)方法是泛化理论中越来越广泛使用的方差分量估计方法。然而,作为MCMC方法的一个重要组成部分,非信息先验尚未得到研究,不同的GT研究对非信息先验的使用也各不相同。本文主要研究了不同的非信息先验对方差成分估计的影响。基于p××r设计、八不提供信息的先验分布为仿真研究和实证研究,选择包括σ2∼我n v - g m m(0.001, 0.001)[1]之前,σ2∼我n v - g m m(1, 1)[2]之前,σ2∼u n i f o r m(0.001, 1000)[3]之前,σ∼u n i f o r m(0, 100)[4]之前,日志(σ2)∼u n i f o r m(- 10、10)[5]之前,1σ2∼p r e t o(0.001)之前[6],σ2(σ2 +τ2)2∼u n i f o r m[7]之前,和σ 22 τ (σ + τ) 2 ~ u n I f or m(0,1)[先验8]。并计算了完整数据和10%缺失/稀疏数据的三个后验点估计(即均值、中位数和众数)。经过仿真研究和实证研究,结果表明:(1)σ 2 ~ in v - g a m ma (0.001, 0.001) [prior 1]在大多数情况下的后验点估计性能最好且更稳定,而1 σ 2 ~ p ar o (1,0.001) [prior 6]总是最差的后验点估计;(2)不同方法的差异主要体现在方差分量σ i 2和σ r 2上,先验6存在明显的极值偏差,极值偏差最大可达281.09和167.59;(3)后验均值估计总是产生最大的偏差,但后验中值估计是最好的;(4)当方差分量的水平数较小时,无信息先验间方差分量的估计差异较大;(5)完整数据与10%缺失/稀疏数据的结果基本相同。少量的缺失/稀疏数据对结果的影响很小。这8个发行版的运行时间从489.78秒到692.58秒不等,彼此之间差别不大。
{"title":"Influence of Uninformative Prior Distributions for MCMC Method on Estimating Variance Components in Generalizability Theory.","authors":"Guangming Li","doi":"10.1177/01466216261415631","DOIUrl":"10.1177/01466216261415631","url":null,"abstract":"<p><p>The Markov chain Monte Carlo (MCMC) method is more and more widely used to estimate variance components in generalizability theory (GT). However, as an essential part of MCMC method, uninformative priors haven't been explored and different GT researches vary in the use of uninformative priors. This study focused on effect of the different uninformative priors on estimating variance components. Based on <i>p × i × r</i> design, eight uninformative prior distributions were chosen for simulation study and empirical study, including <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 1], <math> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>∼</mo> <mi>i</mi> <mi>n</mi> <mi>v</mi> <mo>-</mo> <mi>g</mi> <mi>a</mi> <mi>m</mi> <mi>m</mi> <mi>a</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 2], <math> <mrow> <msup><mrow><mo> </mo> <mi>σ</mi></mrow> <mn>2</mn></msup> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0.001</mn> <mo>,</mo> <mn>1000</mn></mrow> <mo>)</mo></mrow> </mrow> </math> <b>[</b>prior 3<b>]</b>, <math><mrow><mi>σ</mi> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>100</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 4], <math><mrow><mi>log</mi> <mo></mo> <mrow><mo>(</mo> <msup><mi>σ</mi> <mn>2</mn></msup> <mo>)</mo></mrow> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mo>-</mo> <mn>10</mn> <mo>,</mo> <mn>10</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 5], <math> <mrow><mfrac><mn>1</mn> <msup><mi>σ</mi> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>p</mi> <mi>a</mi> <mi>r</mi> <mi>e</mi> <mi>t</mi> <mi>o</mi> <mrow><mo>(</mo> <mrow><mn>1</mn> <mo>,</mo> <mn>0.001</mn></mrow> <mo>)</mo></mrow> <mo> </mo> <mrow><mo>[</mo> <mrow><mtext>prior</mtext> <mo> </mo> <mn>6</mn></mrow> <mo>]</mo></mrow> </mrow> </math> , <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mo>(</mo> <mrow><msup><mi>σ</mi> <mn>2</mn></msup> <mo>+</mo> <msup><mi>τ</mi> <mn>2</mn></msup> </mrow> <mo>)</mo></mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi></mrow> </math> [prior 7], and <math> <mrow> <mfrac><msup><mi>σ</mi> <mn>2</mn></msup> <msup><mrow><mn>2</mn> <mi>τ</mi> <mrow><mo>(</mo> <mrow><mi>σ</mi> <mo>+</mo> <mi>τ</mi></mrow> <mo>)</mo></mrow> </mrow> <mn>2</mn></msup> </mfrac> <mo>∼</mo> <mi>u</mi> <mi>n</mi> <mi>i</mi> <mi>f</mi> <mi>o</mi> <mi>r</mi> <mi>m</mi> <mrow><mo>(</mo> <mrow><mn>0</mn> <mo>,</mo> <mn>1</mn></mrow> <mo>)</mo></mrow> </mrow> </math> [prior 8","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261415631"},"PeriodicalIF":1.2,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12867738/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146126804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1177/01466216261420305
Xiaozhu Jian, Buyun Dai, Yeqi Qing, YuanPing Deng
This study presents a novel extension of the weighted score logistic model (WSLM). The WSLM is an advancement of the traditional dichotomous logistic model that incorporates an additional weighted score parameter. This model is specifically designed to analyze non-continuous category scored polytomous items in educational and psychological testing contexts. Within the WSLM framework, the mean difficulty parameter reflects the overall item difficulty, while both discrimination and mean difficulty parameters are estimated using marginal maximum likelihood estimation. A Monte Carlo simulation study was conducted to evaluate the performance of the WSLM, which demonstrated low levels of bias and root mean square error (RMSE) of item parameters, indicative of accurate parameter recovery. Under most simulation conditions, the fit statistics Q1 and Q4 for polytomous items under the WSLM remained below their respective critical chi-square values, suggesting acceptable model-data fit. These results support the applicability and robustness of the WSLM in practical assessment settings involving complex scoring schemes.
{"title":"Estimating and Fitting the Non-continuous category scored Polytomous Items under the Weighted Score Logistic Model and its Simulation Study.","authors":"Xiaozhu Jian, Buyun Dai, Yeqi Qing, YuanPing Deng","doi":"10.1177/01466216261420305","DOIUrl":"https://doi.org/10.1177/01466216261420305","url":null,"abstract":"<p><p>This study presents a novel extension of the weighted score logistic model (WSLM). The WSLM is an advancement of the traditional dichotomous logistic model that incorporates an additional weighted score parameter. This model is specifically designed to analyze non-continuous category scored polytomous items in educational and psychological testing contexts. Within the WSLM framework, the mean difficulty parameter reflects the overall item difficulty, while both discrimination and mean difficulty parameters are estimated using marginal maximum likelihood estimation. A Monte Carlo simulation study was conducted to evaluate the performance of the WSLM, which demonstrated low levels of bias and root mean square error (RMSE) of item parameters, indicative of accurate parameter recovery. Under most simulation conditions, the fit statistics Q1 and Q4 for polytomous items under the WSLM remained below their respective critical chi-square values, suggesting acceptable model-data fit. These results support the applicability and robustness of the WSLM in practical assessment settings involving complex scoring schemes.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261420305"},"PeriodicalIF":1.2,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12854999/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146108014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1177/01466216261416025
Jari Metsämuuronen
Cohen's d is the most commonly used estimator to quantify the magnitude of the difference between the means of two subpopulations. When comparing multiple populations simultaneously, Cohen's f can be used for the same purpose. Using their relationship in the dichotomous setting, several general formulas for d are derived that generalize d to the polytomous setting. The traditional simplified estimator d = 2f is studied as a shortcut estimator. It is strongly recommended to use the general formulas instead of the simplified ones when assessing the magnitude of the effect size, especially when the discrepancy of the extreme proportions of cases in the subpopulations exceeds 0.40.
{"title":"Generalized Cohen's d for Multiple Means and Polytomous Settings.","authors":"Jari Metsämuuronen","doi":"10.1177/01466216261416025","DOIUrl":"10.1177/01466216261416025","url":null,"abstract":"<p><p>Cohen's <i>d</i> is the most commonly used estimator to quantify the magnitude of the difference between the means of two subpopulations. When comparing multiple populations simultaneously, Cohen's <i>f</i> can be used for the same purpose. Using their relationship in the dichotomous setting, several general formulas for <i>d</i> are derived that generalize <i>d</i> to the polytomous setting. The traditional simplified estimator <i>d</i> = 2<i>f</i> is studied as a shortcut estimator. It is strongly recommended to use the general formulas instead of the simplified ones when assessing the magnitude of the effect size, especially when the discrepancy of the extreme proportions of cases in the subpopulations exceeds 0.40.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216261416025"},"PeriodicalIF":1.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12819128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146031375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1177/01466216251415011
Yale Quan, Chun Wang
Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.
{"title":"Calibrating Multidimensional Assessments With Structural Missingness: An Application of a Multiple-Group Higher-Order IRT Model.","authors":"Yale Quan, Chun Wang","doi":"10.1177/01466216251415011","DOIUrl":"10.1177/01466216251415011","url":null,"abstract":"<p><p>Educational Constructs are becoming increasingly complex and are often conceptualized at both a general level and a subdomain level. It is often desirable to report scores from both levels simultaneously. However, to measure such complex constructs, a very large item bank that is hard for a student to complete in any reasonable timeframe is needed. Furthermore, most current score reporting practices either only report subdomain scores, or the general domain score is calculated post hoc. We propose that a multiple group HO-IRT model with structural missingness can be used to simultaneously report general and subdomain scores while controlling assessment length. Although the model itself is not new, we consider a novel application scenario using a NEAT design with both a representative and non-representative anchor test. While a representative anchor test is recommended in literature, it is sometimes unrealistic in practice when the multidimensional construct shifts over time. Hence, exploring the parameter recovery of multiple group HO-IRT in the presence of non-representative anchor test is especially interesting and important. We show, through Monte Carlo simulation, that the RMSE of IRT estimates retrieved under a non-representative anchor item set with a moderate correlation between the higher- and lower-order factors, is comparable to the RMSE of IRT estimates retrieved under a representative anchor item set. Missing data were addressed using a full-information maximum likelihood approach to parameter estimation.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251415011"},"PeriodicalIF":1.2,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12779540/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145953603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1177/01466216251415189
Sean Joo, Philseok Lee, Stephen Stark
The field of psychometrics has made remarkable progress in developing item response theory (IRT) models for analyzing multidimensional forced choice (MFC) measures. This study introduces an innovative method that enhances the latent trait estimation of the Multi-Unidimensional Pairwise Preference (MUPP) model by incorporating latent regression modeling. To validate the efficacy of the new method, we conducted a comprehensive simulation study. The results of the study provide compelling evidence that the proposed latent regression MUPP (LR-MUPP) model significantly improves the accuracy of the latent trait estimation. This study opens new avenues for future research and encourages further development and refinement of MFC IRT models and their applications.
{"title":"Improving Latent Trait Estimation in Multidimensional Forced Choice Measures: Latent Regression Multi-Unidimensional Pairwise Preference Model.","authors":"Sean Joo, Philseok Lee, Stephen Stark","doi":"10.1177/01466216251415189","DOIUrl":"10.1177/01466216251415189","url":null,"abstract":"<p><p>The field of psychometrics has made remarkable progress in developing item response theory (IRT) models for analyzing multidimensional forced choice (MFC) measures. This study introduces an innovative method that enhances the latent trait estimation of the Multi-Unidimensional Pairwise Preference (MUPP) model by incorporating latent regression modeling. To validate the efficacy of the new method, we conducted a comprehensive simulation study. The results of the study provide compelling evidence that the proposed latent regression MUPP (LR-MUPP) model significantly improves the accuracy of the latent trait estimation. This study opens new avenues for future research and encourages further development and refinement of MFC IRT models and their applications.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251415189"},"PeriodicalIF":1.2,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12764422/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145907257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-07DOI: 10.1177/01466216251401214
Jianbin Fu, Xuan Tan, Patrick C Kyllonen
Theoretically, the generalized graded unfolding model (GGUM) is more flexible than the generalized partial credit model (GPCM), a dominance model. For item responses generated by the GPCM, the GGUM estimations can generate overlapping item response curves with those from the GPCM over a range of latent trait scores covering almost all of the population. The discrimination and category threshold estimates from the two models are approximately equal. It is necessary to use an informative prior around an extreme location (e.g., 4 for a positive GPCM item) or fix the extreme locations in the GGUM estimation of GPCM items to achieve the desired estimation. The simulation study and the applications on two real datasets support the theoretical claims. Various practical implications are discussed, and suggestions for future research are provided.
{"title":"Can the Generalized Graded Unfolding Model Fit Dominance Responses?","authors":"Jianbin Fu, Xuan Tan, Patrick C Kyllonen","doi":"10.1177/01466216251401214","DOIUrl":"10.1177/01466216251401214","url":null,"abstract":"<p><p>Theoretically, the generalized graded unfolding model (GGUM) is more flexible than the generalized partial credit model (GPCM), a dominance model. For item responses generated by the GPCM, the GGUM estimations can generate overlapping item response curves with those from the GPCM over a range of latent trait scores covering almost all of the population. The discrimination and category threshold estimates from the two models are approximately equal. It is necessary to use an informative prior around an extreme location (e.g., 4 for a positive GPCM item) or fix the extreme locations in the GGUM estimation of GPCM items to achieve the desired estimation. The simulation study and the applications on two real datasets support the theoretical claims. Various practical implications are discussed, and suggestions for future research are provided.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251401214"},"PeriodicalIF":1.2,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12682685/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145716376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}