Pub Date : 2025-08-23DOI: 10.1177/01466216251363244
Chunyan Liu, Zhongmin Cui
Test developers typically use alternate test forms to protect the integrity of test scores. Because test forms may differ in difficulty, scores on different test forms are adjusted through a psychometrical procedure called equating. When conducting equating, psychometricians often apply smoothing methods to reduce random error of equating resulting from sampling. During the process, they compare plots of different smoothing degrees and choose the optimal value when using the cubic spline postsmoothing method. This manual process, however, could be automated with the help of deep learning-a machine learning technique commonly used for image classification. In this study, a convolutional neural network was trained using human-classified postsmoothing plots. The trained network was used to choose optimal smoothing values with empirical testing data, which were compared to human choices. The agreement rate between humans and the trained network was as large as 71%, suggesting the potential use of deep learning for choosing optimal smoothing values for equating.
{"title":"Using Deep Learning to Choose Optimal Smoothing Values for Equating.","authors":"Chunyan Liu, Zhongmin Cui","doi":"10.1177/01466216251363244","DOIUrl":"https://doi.org/10.1177/01466216251363244","url":null,"abstract":"<p><p>Test developers typically use alternate test forms to protect the integrity of test scores. Because test forms may differ in difficulty, scores on different test forms are adjusted through a psychometrical procedure called equating. When conducting equating, psychometricians often apply smoothing methods to reduce random error of equating resulting from sampling. During the process, they compare plots of different smoothing degrees and choose the optimal value when using the cubic spline postsmoothing method. This manual process, however, could be automated with the help of deep learning-a machine learning technique commonly used for image classification. In this study, a convolutional neural network was trained using human-classified postsmoothing plots. The trained network was used to choose optimal smoothing values with empirical testing data, which were compared to human choices. The agreement rate between humans and the trained network was as large as 71%, suggesting the potential use of deep learning for choosing optimal smoothing values for equating.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251363244"},"PeriodicalIF":1.2,"publicationDate":"2025-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12374957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144974566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-30DOI: 10.1177/01466216251363240
Inga Laukaityte, Gabriel Wallin, Marie Wiberg
Ensuring that test scores are fair and comparable across different test forms and different test groups is a significant statistical challenge in educational testing. Methods to achieve score comparability, a process known as test score equating, often rely on including common test items or assuming that test taker groups are similar in key characteristics. This study explores a novel approach that combines propensity scores, based on test takers' background covariates, with information from common items using kernel smoothing techniques for binary-scored test items. An empirical analysis using data from a high-stakes college admissions test evaluates the standard errors and differences in adjusted test scores. A simulation study examines the impact of factors such as the number of test takers, the number of common items, and the correlation between covariates and test scores on the method's performance. The findings demonstrate that integrating propensity scores with common item information reduces standard errors and bias more effectively than using either source alone. This suggests that balancing the groups on the test-takers' covariates enhance the fairness and accuracy of test score comparisons across different groups. The proposed method highlights the benefits of considering all the collected data to improve score comparability.
{"title":"Combining Propensity Scores and Common Items for Test Score Equating.","authors":"Inga Laukaityte, Gabriel Wallin, Marie Wiberg","doi":"10.1177/01466216251363240","DOIUrl":"10.1177/01466216251363240","url":null,"abstract":"<p><p>Ensuring that test scores are fair and comparable across different test forms and different test groups is a significant statistical challenge in educational testing. Methods to achieve score comparability, a process known as test score equating, often rely on including common test items or assuming that test taker groups are similar in key characteristics. This study explores a novel approach that combines propensity scores, based on test takers' background covariates, with information from common items using kernel smoothing techniques for binary-scored test items. An empirical analysis using data from a high-stakes college admissions test evaluates the standard errors and differences in adjusted test scores. A simulation study examines the impact of factors such as the number of test takers, the number of common items, and the correlation between covariates and test scores on the method's performance. The findings demonstrate that integrating propensity scores with common item information reduces standard errors and bias more effectively than using either source alone. This suggests that balancing the groups on the test-takers' covariates enhance the fairness and accuracy of test score comparisons across different groups. The proposed method highlights the benefits of considering all the collected data to improve score comparability.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251363240"},"PeriodicalIF":1.2,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12310624/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144776645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1177/01466216251360562
Katarzyna Stapor, Grzegorz Kończak, Damian Grabowski, Marta Żywiołek-Szeja, Agata Chudzicka-Czupała
The paper addresses the problem of detecting a statistically significant subset of input considered relationships. The Pearson linear correlation coefficient calculated from a sample was used to determine the strength of a relationship. Simultaneous testing of the significance of many relationships is related to the issue of multiple hypothesis testing. In such a scenario, the probability of making a type I error without proper error control is, in practice, much higher than the assumed level of significance. The paper proposes an alternative approach: a new stepwise procedure (MCorrSeqPerm) allowing for finding the maximum statistically significant system of linear correlations keeping the error at the assumed level. The proposed procedure relies on a sequence of permutation tests. Its application in the analysis of relationships in the problem of examining stress experienced at work and job satisfaction was compared with Holm's classic method in detecting the number of significant correlations.
{"title":"MCorrSeqPerm: Searching for the Maximum Statistically Significant System of Linear Correlations and its Application in Work Psychology.","authors":"Katarzyna Stapor, Grzegorz Kończak, Damian Grabowski, Marta Żywiołek-Szeja, Agata Chudzicka-Czupała","doi":"10.1177/01466216251360562","DOIUrl":"10.1177/01466216251360562","url":null,"abstract":"<p><p>The paper addresses the problem of detecting a statistically significant subset of input considered relationships. The Pearson linear correlation coefficient calculated from a sample was used to determine the strength of a relationship. Simultaneous testing of the significance of many relationships is related to the issue of multiple hypothesis testing. In such a scenario, the probability of making a type I error without proper error control is, in practice, much higher than the assumed level of significance. The paper proposes an alternative approach: a new stepwise procedure (MCorrSeqPerm) allowing for finding the maximum statistically significant system of linear correlations keeping the error at the assumed level. The proposed procedure relies on a sequence of permutation tests. Its application in the analysis of relationships in the problem of examining stress experienced at work and job satisfaction was compared with Holm's classic method in detecting the number of significant correlations.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251360562"},"PeriodicalIF":1.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12279768/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144700124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-15DOI: 10.1177/01466216251360311
Pere J Ferrando, Fabia Morales-Vives, José M Casas, David Navarro-González
Unipolar constructs are encountered in a variety of non-cognitive measurement scenarios that include clinical and forensic assessments, symptoms checklists, addictive behaviors, and irrational beliefs among others. Furthermore, Item Response Theory (IRT) models intended for fitting and scoring measures of unipolar constructs, particularly Log-Logistic models, are fully developed at present, but they are limited to unidimensional structures. This paper proposes a novel multidimensional log-logistic IRT model intended for double-bounded continuous response items that measure unipolar constructs. The chosen response format is a natural application, and is increasingly used, in the scenarios for which the model is intended. The proposed model is remarkably simple, has interesting properties and, at the structural level can be fitted by using linearizing transformations. Multidimensional item location and discrimination indices are developed, and procedures for fitting the model, scoring the respondents, and assessing conditional and marginal accuracy (including information curves) are proposed. Everything that is proposed has been implemented in fully available R program. The functioning of the model is illustrated by using an empirical example with the data of 371 undergraduate students who answered the Depression and Anxiety subscales of the Brief Symptom Inventory 18 and also the Rosenberg Self-Esteem Scale. The results show the usefulness of the new model to adequately interpret unipolar variables, particularly in terms of the conditional reliability of trait estimates and external validity.
{"title":"A Multidimensional Continuous Response Model for Measuring Unipolar Traits.","authors":"Pere J Ferrando, Fabia Morales-Vives, José M Casas, David Navarro-González","doi":"10.1177/01466216251360311","DOIUrl":"10.1177/01466216251360311","url":null,"abstract":"<p><p>Unipolar constructs are encountered in a variety of non-cognitive measurement scenarios that include clinical and forensic assessments, symptoms checklists, addictive behaviors, and irrational beliefs among others. Furthermore, Item Response Theory (IRT) models intended for fitting and scoring measures of unipolar constructs, particularly Log-Logistic models, are fully developed at present, but they are limited to unidimensional structures. This paper proposes a novel multidimensional log-logistic IRT model intended for double-bounded continuous response items that measure unipolar constructs. The chosen response format is a natural application, and is increasingly used, in the scenarios for which the model is intended. The proposed model is remarkably simple, has interesting properties and, at the structural level can be fitted by using linearizing transformations. Multidimensional item location and discrimination indices are developed, and procedures for fitting the model, scoring the respondents, and assessing conditional and marginal accuracy (including information curves) are proposed. Everything that is proposed has been implemented in fully available R program. The functioning of the model is illustrated by using an empirical example with the data of 371 undergraduate students who answered the Depression and Anxiety subscales of the <i>Brief Symptom Inventory 18</i> and also the <i>Rosenberg Self-Esteem Scale.</i> The results show the usefulness of the new model to adequately interpret unipolar variables, particularly in terms of the conditional reliability of trait estimates and external validity.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251360311"},"PeriodicalIF":1.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12267208/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144676191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-15DOI: 10.1177/01466216251360565
Ihnwhi Heo, Ren Liu, Haiyan Liu, Sarah Depaoli, Fan Jia
Latent state-trait (LST) theory provides a psychometric framework that facilitates the measurement of long-term trait change and short-term state variability in longitudinal data. While LST theory has guided the development and extension of linear latent growth models within its theoretical framework, the integration of piecewise growth models (PGMs) into the LST theory framework remains uninvestigated. PGMs are well suited for modeling nonlinear developmental processes comprised of distinct stages, which frequently arise in psychological and educational research. Their ability to capture phase-specific changes makes them a useful tool for applied and methodological researchers. This paper introduces a novel measurement approach that integrates PGMs into the framework of LST theory by presenting single-indicator piecewise growth models (SI-PGMs) and multiple-indicator piecewise growth models (MI-PGMs). We detail the model specifications for both SI-PGMs and MI-PGMs. For SI-PGMs, we define the reliability coefficient; for MI-PGMs, we define the consistency coefficient, occasion specificity coefficient, and reliability coefficient. We then conduct simulations to evaluate the models' performance in accurately recovering growth parameters and capturing true reliability. The simulation results indicated that SI-PGMs and MI-PGMs successfully recovered growth parameters and performed comparably in the absence of situational influences. However, MI-PGMs outperformed SI-PGMs when situational influences were present. We conclude by outlining directions for future research and providing Mplus syntax to support the dissemination of the models.
{"title":"A Study of Latent State-Trait Theory Framework in Piecewise Growth Models.","authors":"Ihnwhi Heo, Ren Liu, Haiyan Liu, Sarah Depaoli, Fan Jia","doi":"10.1177/01466216251360565","DOIUrl":"10.1177/01466216251360565","url":null,"abstract":"<p><p>Latent state-trait (LST) theory provides a psychometric framework that facilitates the measurement of long-term trait change and short-term state variability in longitudinal data. While LST theory has guided the development and extension of linear latent growth models within its theoretical framework, the integration of piecewise growth models (PGMs) into the LST theory framework remains uninvestigated. PGMs are well suited for modeling nonlinear developmental processes comprised of distinct stages, which frequently arise in psychological and educational research. Their ability to capture phase-specific changes makes them a useful tool for applied and methodological researchers. This paper introduces a novel measurement approach that integrates PGMs into the framework of LST theory by presenting single-indicator piecewise growth models (SI-PGMs) and multiple-indicator piecewise growth models (MI-PGMs). We detail the model specifications for both SI-PGMs and MI-PGMs. For SI-PGMs, we define the reliability coefficient; for MI-PGMs, we define the consistency coefficient, occasion specificity coefficient, and reliability coefficient. We then conduct simulations to evaluate the models' performance in accurately recovering growth parameters and capturing true reliability. The simulation results indicated that SI-PGMs and MI-PGMs successfully recovered growth parameters and performed comparably in the absence of situational influences. However, MI-PGMs outperformed SI-PGMs when situational influences were present. We conclude by outlining directions for future research and providing M<i>plus</i> syntax to support the dissemination of the models.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251360565"},"PeriodicalIF":1.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12264255/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144660748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-14DOI: 10.1177/01466216251360544
Jongwan Kim
This study introduces a novel structure-based classification (SBC) framework that leverages pairwise distance representations of rating data to enhance classification performance while mitigating individual differences in scale usage. Unlike conventional feature-based approaches that rely on absolute rating scores, SBC transforms rating data into structured representations by computing pairwise distances between rating dimensions. This transformation captures the relational structure of ratings, ensuring consistency between training and test datasets and enhancing model robustness. To evaluate the effectiveness of this approach, we conducted a simulation study in which participants rated stimuli across multiple affective dimensions, with systematic individual differences in scale usage. The results demonstrated that SBC successfully classified affective stimuli despite these variations, performing comparably to traditional classification methods. The findings suggest that relational structures among rating dimensions contain meaningful information for affective classification, akin to functional connectivity approaches in cognitive neuroscience. By focusing on rating interdependencies as well as absolute values, SBC provides a robust and generalizable method for analyzing subjective responses, with implications for psychological research.
{"title":"Structure-Based Classification Approach.","authors":"Jongwan Kim","doi":"10.1177/01466216251360544","DOIUrl":"10.1177/01466216251360544","url":null,"abstract":"<p><p>This study introduces a novel structure-based classification (SBC) framework that leverages pairwise distance representations of rating data to enhance classification performance while mitigating individual differences in scale usage. Unlike conventional feature-based approaches that rely on absolute rating scores, SBC transforms rating data into structured representations by computing pairwise distances between rating dimensions. This transformation captures the relational structure of ratings, ensuring consistency between training and test datasets and enhancing model robustness. To evaluate the effectiveness of this approach, we conducted a simulation study in which participants rated stimuli across multiple affective dimensions, with systematic individual differences in scale usage. The results demonstrated that SBC successfully classified affective stimuli despite these variations, performing comparably to traditional classification methods. The findings suggest that relational structures among rating dimensions contain meaningful information for affective classification, akin to functional connectivity approaches in cognitive neuroscience. By focusing on rating interdependencies as well as absolute values, SBC provides a robust and generalizable method for analyzing subjective responses, with implications for psychological research.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251360544"},"PeriodicalIF":1.0,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12264251/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144660749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1177/01466216251358492
R Philip Chalmers, Sarah Campbell
The reliable change index (RCI; Jacobson & Truax, 1991) is commonly used to assess whether individuals have changed across two measurement occasions, and has seen many augmentations and improvements since its initial conception. In this study, we extend an item response theory version of the RCI presented by Jabrayilov et al. (2016) by including empirical priors in the associated RCI computations whenever group-level differences are quantifiable given post-test response information. Based on a reanalysis and extension of a previous simulation study, we demonstrate that although a small amount of bias is added to the estimates of the latent trait differences when no true change is present, including empirical prior information will generally improve the Type I behavior of the model-based RCI. Consequently, when non-zero changes in the latent trait are present the bias and sampling variability are show to be more favorable than competing estimators, subsequently leading to an increase in power to detect non-zero changes.
{"title":"Including Empirical Prior Information in the Reliable Change Index.","authors":"R Philip Chalmers, Sarah Campbell","doi":"10.1177/01466216251358492","DOIUrl":"10.1177/01466216251358492","url":null,"abstract":"<p><p>The reliable change index (RCI; Jacobson & Truax, 1991) is commonly used to assess whether individuals have changed across two measurement occasions, and has seen many augmentations and improvements since its initial conception. In this study, we extend an item response theory version of the RCI presented by Jabrayilov et al. (2016) by including empirical priors in the associated RCI computations whenever group-level differences are quantifiable given post-test response information. Based on a reanalysis and extension of a previous simulation study, we demonstrate that although a small amount of bias is added to the estimates of the latent trait differences when no true change is present, including empirical prior information will generally improve the Type I behavior of the model-based RCI. Consequently, when non-zero changes in the latent trait are present the bias and sampling variability are show to be more favorable than competing estimators, subsequently leading to an increase in power to detect non-zero changes.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251358492"},"PeriodicalIF":1.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12245826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144627476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-07DOI: 10.1177/01466216251358491
Michael T Kane, Joanne Kane
This paper makes three contributions to our understanding of measurement bias and predictive bias in testing. First, we develop a linear model for assessing measurement bias across two tests and two groups in terms of the estimated true-score relationships between the two tests in the two groups. This new model for measurement bias is structurally similar to the Cleary model for predictive bias, but it relies on the Errors-in-Variables (EIV) regression model, rather than the Ordinary-Least-Squares (OLS) regression model. Second, we examine some differences between measurement bias and predictive bias in three cases in which two groups have different true-score means, and we illustrate how regression toward the mean in OLS regression can lead to questionable conclusions about test bias if the differences between measurement bias and predictive bias are ignored. Third, we reevaluate a body of empirical findings suggesting that the tests employed in college-admissions and employment-testing programs tend to over-predict criterion performance for minorities, and we show that these findings are consistent with the occurrence of substantial measurement bias against the minority group relative to the majority group.
{"title":"Using Group Differences in True Score Relationships to Evaluate Measurement Bias.","authors":"Michael T Kane, Joanne Kane","doi":"10.1177/01466216251358491","DOIUrl":"10.1177/01466216251358491","url":null,"abstract":"<p><p>This paper makes three contributions to our understanding of measurement bias and predictive bias in testing. First, we develop a linear model for assessing measurement bias across two tests and two groups in terms of the estimated true-score relationships between the two tests in the two groups. This new model for measurement bias is structurally similar to the Cleary model for predictive bias, but it relies on the Errors-in-Variables (EIV) regression model, rather than the Ordinary-Least-Squares (OLS) regression model. Second, we examine some differences between measurement bias and predictive bias in three cases in which two groups have different true-score means, and we illustrate how regression toward the mean in OLS regression can lead to questionable conclusions about test bias if the differences between measurement bias and predictive bias are ignored. Third, we reevaluate a body of empirical findings suggesting that the tests employed in college-admissions and employment-testing programs tend to over-predict criterion performance for minorities, and we show that these findings are consistent with the occurrence of substantial measurement bias against the minority group relative to the majority group.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251358491"},"PeriodicalIF":1.0,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12234520/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144601949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-05DOI: 10.1177/01466216251351947
Paul A Jewsbury
Score linking is widely used to place scores from different assessments, or the same assessment under different conditions, onto a common scale. A central concern is whether the linking function is invariant across subpopulations, as violations may threaten fairness. However, evaluating subpopulation differences in linked scores is challenging because linking error is not independent of sampling and measurement error when the same data are used to estimate the linking function and to compare score distributions. We show that common approaches involving neglecting linking error or treating it as independent substantially overestimate the standard errors of subpopulation differences. We introduce new methods that account for linking error dependencies. Simulation results demonstrate the accuracy of the proposed methods, and a practical example with real data illustrates how improved standard error estimation enhances power for detecting subpopulation non-invariance.
{"title":"Standard Error Estimation for Subpopulation Non-invariance.","authors":"Paul A Jewsbury","doi":"10.1177/01466216251351947","DOIUrl":"10.1177/01466216251351947","url":null,"abstract":"<p><p>Score linking is widely used to place scores from different assessments, or the same assessment under different conditions, onto a common scale. A central concern is whether the linking function is invariant across subpopulations, as violations may threaten fairness. However, evaluating subpopulation differences in linked scores is challenging because linking error is not independent of sampling and measurement error when the same data are used to estimate the linking function and to compare score distributions. We show that common approaches involving neglecting linking error or treating it as independent substantially overestimate the standard errors of subpopulation differences. We introduce new methods that account for linking error dependencies. Simulation results demonstrate the accuracy of the proposed methods, and a practical example with real data illustrates how improved standard error estimation enhances power for detecting subpopulation non-invariance.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251351947"},"PeriodicalIF":1.0,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12228644/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144585323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01DOI: 10.1177/01466216251351949
Lavanya S Kumar, Naidan Tu, Sean Joo, Stephen Stark
Multidimensional forced choice (MFC) measures are gaining prominence in noncognitive assessment. Yet there has been little research on detecting differential item functioning (DIF) with models for forced choice measures. This research extended two well-known DIF detection methods to MFC measures. Specifically, the performance of Lord's chi-square and item parameter replication (IPR) methods with MFC tests based on the Multi-Unidimensional Pairwise Preference (MUPP) model was investigated. The Type I error rate and power of the DIF detection methods were examined in a Monte Carlo simulation that manipulated sample size, impact, DIF source, and DIF magnitude. Both methods showed consistent power and were found to control Type I error well across study conditions, indicating that established approaches to DIF detection work well with the MUPP model. Lord's chi-square outperformed the IPR method when DIF source was statement discrimination while the opposite was true when DIF source was statement threshold. Also, both methods performed similarly and showed better power when DIF source was statement location, in line with previous research. Study implications and practical recommendations for DIF detection with MFC tests, as well as limitations, are discussed.
{"title":"Detecting DIF with the Multi-Unidimensional Pairwise Preference Model: Lord's Chi-square and IPR-NCDIF Methods.","authors":"Lavanya S Kumar, Naidan Tu, Sean Joo, Stephen Stark","doi":"10.1177/01466216251351949","DOIUrl":"10.1177/01466216251351949","url":null,"abstract":"<p><p>Multidimensional forced choice (MFC) measures are gaining prominence in noncognitive assessment. Yet there has been little research on detecting differential item functioning (DIF) with models for forced choice measures. This research extended two well-known DIF detection methods to MFC measures. Specifically, the performance of Lord's chi-square and item parameter replication (IPR) methods with MFC tests based on the Multi-Unidimensional Pairwise Preference (MUPP) model was investigated. The Type I error rate and power of the DIF detection methods were examined in a Monte Carlo simulation that manipulated sample size, impact, DIF source, and DIF magnitude. Both methods showed consistent power and were found to control Type I error well across study conditions, indicating that established approaches to DIF detection work well with the MUPP model. Lord's chi-square outperformed the IPR method when DIF source was statement discrimination while the opposite was true when DIF source was statement threshold. Also, both methods performed similarly and showed better power when DIF source was statement location, in line with previous research. Study implications and practical recommendations for DIF detection with MFC tests, as well as limitations, are discussed.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251351949"},"PeriodicalIF":1.0,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12213542/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144561576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}