Pub Date : 2025-07-01Epub Date: 2025-02-02DOI: 10.1177/01466216251316277
Chen-Wei Liu
Response time (RT) has been an essential resource for supplementing the estimation accuracy of latent traits and item parameters in educational testing. Most item response theory (IRT) approaches are based on parametric RT models. However, since test takers may alter their behaviors during a test due to motivation or strategy shifts, fatigue, or other causes, parametric IRT models are unlikely to capture such subtle and nonlinear information. In this work, we propose a novel semi-parametric IRT model with O'Sullivan splines to accommodate the flexible mean RT shapes and explore the underlying nonlinear relationships between latent traits and RT. A simulation study was conducted to demonstrate the substantial improvement in parameter estimation achieved by the new model, as well as the detriment of using parametric models in terms of biases and measurement errors. Using this model, a dataset of mathematics test scores and RT from the Programme for International Student Assessment was analyzed to demonstrate the evident nonlinearity and to compare the proposed model with existing models in terms of model fitting. The findings presented in this study indicate the promising nature of the new approach, suggesting its potential as an additional psychometric tool to enhance test reliability and reduce measurement errors.
{"title":"Semi-Parametric Item Response Theory With O'Sullivan Splines for Item Responses and Response Time.","authors":"Chen-Wei Liu","doi":"10.1177/01466216251316277","DOIUrl":"10.1177/01466216251316277","url":null,"abstract":"<p><p>Response time (RT) has been an essential resource for supplementing the estimation accuracy of latent traits and item parameters in educational testing. Most item response theory (IRT) approaches are based on parametric RT models. However, since test takers may alter their behaviors during a test due to motivation or strategy shifts, fatigue, or other causes, parametric IRT models are unlikely to capture such subtle and nonlinear information. In this work, we propose a novel semi-parametric IRT model with O'Sullivan splines to accommodate the flexible mean RT shapes and explore the underlying nonlinear relationships between latent traits and RT. A simulation study was conducted to demonstrate the substantial improvement in parameter estimation achieved by the new model, as well as the detriment of using parametric models in terms of biases and measurement errors. Using this model, a dataset of mathematics test scores and RT from the Programme for International Student Assessment was analyzed to demonstrate the evident nonlinearity and to compare the proposed model with existing models in terms of model fitting. The findings presented in this study indicate the promising nature of the new approach, suggesting its potential as an additional psychometric tool to enhance test reliability and reduce measurement errors.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"224-242"},"PeriodicalIF":1.2,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11789044/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143190883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-01Epub Date: 2025-01-28DOI: 10.1177/01466216251316276
Lihong Song, Wenyi Wang
Under the theory of sequential design, compound optimal design with two optimality criteria can be used to solve the problem of efficient calibration of item parameters of item response theory model. In order to efficiently calibrate item parameters in computerized testing, a compound optimal design is proposed for the simultaneous estimation of item difficulty and discrimination parameters under the two-parameter logistic model, which adaptively focuses on optimizing the parameter which is difficult to estimate. The compound optimal design using the acceptance probability can provide ability design points to optimize the item difficulty and discrimination parameters, respectively. Simulation and real data analysis studies showed that the compound optimal design outperformed than the D-optimal and random design in terms of the recovery of both discrimination and difficulty parameters.
{"title":"Compound Optimal Design for Online Item Calibration Under the Two-Parameter Logistic Model.","authors":"Lihong Song, Wenyi Wang","doi":"10.1177/01466216251316276","DOIUrl":"10.1177/01466216251316276","url":null,"abstract":"<p><p>Under the theory of sequential design, compound optimal design with two optimality criteria can be used to solve the problem of efficient calibration of item parameters of item response theory model. In order to efficiently calibrate item parameters in computerized testing, a compound optimal design is proposed for the simultaneous estimation of item difficulty and discrimination parameters under the two-parameter logistic model, which adaptively focuses on optimizing the parameter which is difficult to estimate. The compound optimal design using the acceptance probability can provide ability design points to optimize the item difficulty and discrimination parameters, respectively. Simulation and real data analysis studies showed that the compound optimal design outperformed than the D-optimal and random design in terms of the recovery of both discrimination and difficulty parameters.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"177-198"},"PeriodicalIF":1.2,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11775943/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143068983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-24DOI: 10.1177/01466216251350159
Jari Metsämuuronen, Timi Niemensivu
Communicating the factual meaning of a particular reliability estimate is sometimes difficult. What does a specific reliability estimate of 0.80 or 0.95 mean in common language? Deflation-corrected estimates of reliability (DCER) using Somers' D or Goodman-Kruskal G as the item-score correlations are transformed into forms where specific estimates from the family of common language effect sizes are visible. This makes it possible to communicate reliability estimates using a common language and to evaluate the magnitude of a particular reliability estimate in the same way and with the same metric as we do with effect size estimates. Using a DCER, we can say that with k = 40 items, if the reliability is 0.95, in 80 out of 100 random pairs of test takers from different subpopulations on all items combined, those with a higher item response will also score higher on the test. In this case, using the thresholds familiar from effect sizes, we can say that the reliability is "very high." The transformation of the reliability estimate into a common language effect size depends on the size of the item-score association estimates and the number of items, so no closed-form equations for the transformations are given. However, relevant thresholds are provided for practical use.
{"title":"How to Make Sense of Reliability? Common Language Interpretation of Reliability and the Relation of Reliability to Effect Size.","authors":"Jari Metsämuuronen, Timi Niemensivu","doi":"10.1177/01466216251350159","DOIUrl":"10.1177/01466216251350159","url":null,"abstract":"<p><p>Communicating the factual meaning of a particular reliability estimate is sometimes difficult. What does a specific reliability estimate of 0.80 or 0.95 mean in common language? Deflation-corrected estimates of reliability (DCER) using Somers' <i>D</i> or Goodman-Kruskal <i>G</i> as the item-score correlations are transformed into forms where specific estimates from the family of common language effect sizes are visible. This makes it possible to communicate reliability estimates using a common language and to evaluate the magnitude of a particular reliability estimate in the same way and with the same metric as we do with effect size estimates. Using a DCER, we can say that with <i>k</i> = 40 items, if the reliability is 0.95, in 80 out of 100 random pairs of test takers from different subpopulations on all items combined, those with a higher item response will also score higher on the test. In this case, using the thresholds familiar from effect sizes, we can say that the reliability is \"very high.\" The transformation of the reliability estimate into a common language effect size depends on the size of the item-score association estimates and the number of items, so no closed-form equations for the transformations are given. However, relevant thresholds are provided for practical use.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251350159"},"PeriodicalIF":1.0,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12187714/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144508891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-12DOI: 10.1177/01466216251350342
Eisuke Segawa
Summed-score (SS)-based scoring in non-Rasch IRT allows for pencil-and-paper administration and is used in the Patient-Reported Outcomes Measurement Information System (PROMIS) alongside response-pattern-based scoring. However, this convenience comes with an increase in uncertainty (the increase) associated with SS scoring. The increase can be quantified through the relationship between Bayesian SS and RP scoring. Given an SS of s, the SS posterior is a weighted sum of RP posteriors, with weights representing the marginal probabilities of RPs. From this mixture, the SS score (SS posterior mean) is a weighted sum of RP posterior means, and its uncertainty (variance of the SS posterior) is decomposed into the uncertainty of RP scoring (the weighted sum of RP posterior variances) and the increase (variance of RP posterior means). Without quantifying the increase, PROMIS recommends RP scoring for greater accuracy, suggesting SS scoring as a second option. Using variance decomposition, we quantified the increases for two short forms (SFs). In one, the increase is very small, making SS scoring as accurate as RP scoring, while in the other, the increase is large, indicating SS scoring may not be a viable second option. The increase varies widely, influencing scoring decisions, and should be reported for each SF when SS scoring is used.
{"title":"Increase of Uncertainty in Summed-Score-Based Scoring in Non-Rasch IRT.","authors":"Eisuke Segawa","doi":"10.1177/01466216251350342","DOIUrl":"10.1177/01466216251350342","url":null,"abstract":"<p><p>Summed-score (SS)-based scoring in non-Rasch IRT allows for pencil-and-paper administration and is used in the Patient-Reported Outcomes Measurement Information System (PROMIS) alongside response-pattern-based scoring. However, this convenience comes with an increase in uncertainty (the increase) associated with SS scoring. The increase can be quantified through the relationship between Bayesian SS and RP scoring. Given an SS of s, the SS posterior is a weighted sum of RP posteriors, with weights representing the marginal probabilities of RPs. From this mixture, the SS score (SS posterior mean) is a weighted sum of RP posterior means, and its uncertainty (variance of the SS posterior) is decomposed into the uncertainty of RP scoring (the weighted sum of RP posterior variances) and the increase (variance of RP posterior means). Without quantifying the increase, PROMIS recommends RP scoring for greater accuracy, suggesting SS scoring as a second option. Using variance decomposition, we quantified the increases for two short forms (SFs). In one, the increase is very small, making SS scoring as accurate as RP scoring, while in the other, the increase is large, indicating SS scoring may not be a viable second option. The increase varies widely, influencing scoring decisions, and should be reported for each SF when SS scoring is used.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251350342"},"PeriodicalIF":1.0,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12162545/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144303335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-05DOI: 10.1177/01466216251348840
Santtu Tikka, Sonsoles López-Pernas, Mohammed Saqr
Understanding the dynamics of transitions plays a central role in educational research, informing studies of learning processes, motivation shifts, and social interactions. Transition network analysis (TNA) is a unified framework of probabilistic modeling and network analysis for capturing the temporal and relational aspects of transitions between events or states of interest. We introduce the R package tna that implements procedures for estimating the TNA models, building the transition networks, identifying patterns and communities, computing centrality measures, and visualizing the networks. The package also implements several functions for statistical procedures that can be used to assess differences between groups, stability of centrality measures and importance of specific transitions.
{"title":"tna: An R Package for Transition Network Analysis.","authors":"Santtu Tikka, Sonsoles López-Pernas, Mohammed Saqr","doi":"10.1177/01466216251348840","DOIUrl":"10.1177/01466216251348840","url":null,"abstract":"<p><p>Understanding the dynamics of transitions plays a central role in educational research, informing studies of learning processes, motivation shifts, and social interactions. Transition network analysis (TNA) is a unified framework of probabilistic modeling and network analysis for capturing the temporal and relational aspects of transitions between events or states of interest. We introduce the R package tna that implements procedures for estimating the TNA models, building the transition networks, identifying patterns and communities, computing centrality measures, and visualizing the networks. The package also implements several functions for statistical procedures that can be used to assess differences between groups, stability of centrality measures and importance of specific transitions.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251348840"},"PeriodicalIF":1.0,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12141252/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-20DOI: 10.1177/01466216251344288
Sedat Sen, Allan S Cohen
Application of factor mixture models (FMMs) requires determining the correct number of latent classes. A number of studies have examined the performance of several information criterion (IC) indices, but as yet none have studied the effectiveness of the elbow plot method. In this study, therefore, the effectiveness of the elbow plot method was compared with the lowest value criterion and the difference method calculated from five commonly used IC indices. Results of a simulation study showed the elbow plot method to detect the generating model at least 90% of the time for two- and three-class FMMs. Results also showed the elbow plot method did not perform well for two-factor and four-class conditions. The performance of the elbow plot method was generally better than that of the lowest IC value criterion and difference method under two- and three-class conditions. For the four-latent class conditions, there were no meaningful differences between the results of the elbow plot method and the lowest value criterion method. On the other hand, the difference method outperformed the other two methods in conditions with two factors and four classes.
{"title":"On the Use of Elbow Plot Method for Class Enumeration in Factor Mixture Models.","authors":"Sedat Sen, Allan S Cohen","doi":"10.1177/01466216251344288","DOIUrl":"10.1177/01466216251344288","url":null,"abstract":"<p><p>Application of factor mixture models (FMMs) requires determining the correct number of latent classes. A number of studies have examined the performance of several information criterion (IC) indices, but as yet none have studied the effectiveness of the elbow plot method. In this study, therefore, the effectiveness of the elbow plot method was compared with the lowest value criterion and the difference method calculated from five commonly used IC indices. Results of a simulation study showed the elbow plot method to detect the generating model at least 90% of the time for two- and three-class FMMs. Results also showed the elbow plot method did not perform well for two-factor and four-class conditions. The performance of the elbow plot method was generally better than that of the lowest IC value criterion and difference method under two- and three-class conditions. For the four-latent class conditions, there were no meaningful differences between the results of the elbow plot method and the lowest value criterion method. On the other hand, the difference method outperformed the other two methods in conditions with two factors and four classes.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"01466216251344288"},"PeriodicalIF":1.0,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092417/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144129245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-01Epub Date: 2024-12-19DOI: 10.1177/01466216241310602
Shan Huang, Hidetoki Ishii
Many studies on differential item functioning (DIF) detection rely on single detection methods (SDMs), each of which necessitates specific assumptions that may not always be validated. Using an inappropriate SDM can lead to diminished accuracy in DIF detection. To address this limitation, a novel multi-detector combination (MDC) approach is proposed. Unlike SDMs, MDC effectively evaluates the relevance of different SDMs under various test conditions and integrates them using supervised learning, thereby mitigating the risk associated with selecting a suboptimal SDM for DIF detection. This study aimed to validate the accuracy of the MDC approach by applying five types of SDMs and four distinct supervised learning methods in MDC modeling. Model performance was assessed using the area under the curve (AUC), which provided a comprehensive measure of the ability of the model to distinguish between classes across all threshold levels, with higher AUC values indicating higher accuracy. The MDC methods consistently achieved higher average AUC values compared to SDMs in both matched test sets (where test conditions align with the training set) and unmatched test sets. Furthermore, MDC outperformed all SDMs under each test condition. These findings indicated that MDC is highly accurate and robust across diverse test conditions, establishing it as a viable method for practical DIF detection.
{"title":"A Generalized Multi-Detector Combination Approach for Differential Item Functioning Detection.","authors":"Shan Huang, Hidetoki Ishii","doi":"10.1177/01466216241310602","DOIUrl":"10.1177/01466216241310602","url":null,"abstract":"<p><p>Many studies on differential item functioning (DIF) detection rely on single detection methods (SDMs), each of which necessitates specific assumptions that may not always be validated. Using an inappropriate SDM can lead to diminished accuracy in DIF detection. To address this limitation, a novel multi-detector combination (MDC) approach is proposed. Unlike SDMs, MDC effectively evaluates the relevance of different SDMs under various test conditions and integrates them using supervised learning, thereby mitigating the risk associated with selecting a suboptimal SDM for DIF detection. This study aimed to validate the accuracy of the MDC approach by applying five types of SDMs and four distinct supervised learning methods in MDC modeling. Model performance was assessed using the area under the curve (AUC), which provided a comprehensive measure of the ability of the model to distinguish between classes across all threshold levels, with higher AUC values indicating higher accuracy. The MDC methods consistently achieved higher average AUC values compared to SDMs in both matched test sets (where test conditions align with the training set) and unmatched test sets. Furthermore, MDC outperformed all SDMs under each test condition. These findings indicated that MDC is highly accurate and robust across diverse test conditions, establishing it as a viable method for practical DIF detection.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"75-89"},"PeriodicalIF":1.2,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11660104/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142878074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-01Epub Date: 2024-12-26DOI: 10.1177/01466216241310598
Xin Xu, Jinxin Guo, Tao Xin
In psychological and educational measurement, a testlet-based test is a common and popular format, especially in some large-scale assessments. In modeling testlet effects, a standard bifactor model, as a common strategy, assumes different testlet effects and the main effect to be fully independently distributed. However, it is difficult to establish perfectly independent clusters as this assumption. To address this issue, correlations among testlets could be taken into account in fitting data. Moreover, one may desire to maintain a good practical interpretation of the sparse loading matrix. In this paper, we propose data-driven learning of significant correlations in the covariance matrix through a latent variable selection method. Under the proposed method, a regularization is performed on the weak correlations for the extended bifactor model. Further, a stochastic expectation maximization algorithm is employed for efficient computation. Results from simulation studies show the consistency of the proposed method in selecting significant correlations. Empirical data from the 2015 Program for International Student Assessment is analyzed using the proposed method as an example.
{"title":"Inference of Correlations Among Testlet Effects: A Latent Variable Selection Method.","authors":"Xin Xu, Jinxin Guo, Tao Xin","doi":"10.1177/01466216241310598","DOIUrl":"10.1177/01466216241310598","url":null,"abstract":"<p><p>In psychological and educational measurement, a testlet-based test is a common and popular format, especially in some large-scale assessments. In modeling testlet effects, a standard bifactor model, as a common strategy, assumes different testlet effects and the main effect to be fully independently distributed. However, it is difficult to establish perfectly independent clusters as this assumption. To address this issue, correlations among testlets could be taken into account in fitting data. Moreover, one may desire to maintain a good practical interpretation of the sparse loading matrix. In this paper, we propose data-driven learning of significant correlations in the covariance matrix through a latent variable selection method. Under the proposed method, a regularization is performed on the weak correlations for the extended bifactor model. Further, a stochastic expectation maximization algorithm is employed for efficient computation. Results from simulation studies show the consistency of the proposed method in selecting significant correlations. Empirical data from the 2015 Program for International Student Assessment is analyzed using the proposed method as an example.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"126-155"},"PeriodicalIF":1.2,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11670239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142903933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-01Epub Date: 2024-12-30DOI: 10.1177/01466216241310599
Allison W Cooperman, Ming Him Tai, Joseph N DeWeese, David J Weiss
Adaptive measurement of change (AMC) uses computerized adaptive testing (CAT) to measure and test the significance of intraindividual change on one or more latent traits. The extant AMC research has so far assumed that item parameter values are constant across testing occasions. Yet item parameters might change over time, a phenomenon termed item parameter drift (IPD). The current study examined AMC's performance in the context of IPD with unidimensional, dichotomous CATs across two testing occasions. A Monte Carlo simulation revealed that AMC false and true positive rates were primarily affected by changes in the difficulty parameter. False positive rates were related to the location of the drift items relative to the latent trait continuum, as the administration of more drift items spuriously increased the magnitude of estimated trait change. Moreover, true positive rates depended upon an interaction between the direction of difficulty parameter drift and the latent trait change trajectory. A follow-up simulation further showed that the number of items in the CAT with parameter drift impacted AMC false and true positive rates, with these relationships moderated by IPD characteristics and the latent trait change trajectory. It is recommended that test administrators confirm the absence of IPD prior to using AMC for measuring intraindividual change with educational and psychological tests.
{"title":"Adaptive Measurement of Change in the Context of Item Parameter Drift.","authors":"Allison W Cooperman, Ming Him Tai, Joseph N DeWeese, David J Weiss","doi":"10.1177/01466216241310599","DOIUrl":"10.1177/01466216241310599","url":null,"abstract":"<p><p>Adaptive measurement of change (AMC) uses computerized adaptive testing (CAT) to measure and test the significance of intraindividual change on one or more latent traits. The extant AMC research has so far assumed that item parameter values are constant across testing occasions. Yet item parameters might change over time, a phenomenon termed item parameter drift (IPD). The current study examined AMC's performance in the context of IPD with unidimensional, dichotomous CATs across two testing occasions. A Monte Carlo simulation revealed that AMC false and true positive rates were primarily affected by changes in the difficulty parameter. False positive rates were related to the location of the drift items relative to the latent trait continuum, as the administration of more drift items spuriously increased the magnitude of estimated trait change. Moreover, true positive rates depended upon an interaction between the direction of difficulty parameter drift and the latent trait change trajectory. A follow-up simulation further showed that the number of items in the CAT with parameter drift impacted AMC false and true positive rates, with these relationships moderated by IPD characteristics and the latent trait change trajectory. It is recommended that test administrators confirm the absence of IPD prior to using AMC for measuring intraindividual change with educational and psychological tests.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"109-125"},"PeriodicalIF":1.2,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683792/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142915981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-01Epub Date: 2024-12-20DOI: 10.1177/01466216241310600
James O Ramsay, Juan Li, Joakim Wallmark, Marie Wiberg
Modifications of current psychometric models for analyzing test data are proposed that produce an additive scale measure of information. This information measure is a one-dimensional space curve or curved surface manifold that is invariant across varying manifold indexing systems. The arc length along a curve manifold is used as it is an additive metric having a defined zero and a version of the bit as a unit. This property, referred to here as the scope of the test or an item, facilitates the evaluation of graphs and numerical summaries. The measurement power of the test is defined by the length of the manifold, and the performance or experiential level of a person by a position along the curve. In this study, we also use all information from the items including the information from the distractors. Test data from a large-scale college admissions test are used to illustrate the test information manifold perspective and to compare it with the well-known item response theory nominal model. It is illustrated that the use of information theory opens a vista of new ways of assessing item performance and inter-item dependency, as well as test takers' knowledge.
{"title":"An Information Manifold Perspective for Analyzing Test Data.","authors":"James O Ramsay, Juan Li, Joakim Wallmark, Marie Wiberg","doi":"10.1177/01466216241310600","DOIUrl":"10.1177/01466216241310600","url":null,"abstract":"<p><p>Modifications of current psychometric models for analyzing test data are proposed that produce an additive scale measure of information. This information measure is a one-dimensional space curve or curved surface manifold that is invariant across varying manifold indexing systems. The arc length along a curve manifold is used as it is an additive metric having a defined zero and a version of the bit as a unit. This property, referred to here as the scope of the test or an item, facilitates the evaluation of graphs and numerical summaries. The measurement power of the test is defined by the length of the manifold, and the performance or experiential level of a person by a position along the curve. In this study, we also use all information from the items including the information from the distractors. Test data from a large-scale college admissions test are used to illustrate the test information manifold perspective and to compare it with the well-known item response theory nominal model. It is illustrated that the use of information theory opens a vista of new ways of assessing item performance and inter-item dependency, as well as test takers' knowledge.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":" ","pages":"90-108"},"PeriodicalIF":1.2,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11662344/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142878097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}