Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann
Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.
{"title":"Idiographic interrater reliability measures for intensive longitudinal multirater data.","authors":"Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann","doi":"10.1111/bmsp.70022","DOIUrl":"https://doi.org/10.1111/bmsp.70022","url":null,"abstract":"<p><p>Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asymmetric IRT models present theoretically desirable features, but often require large sample sizes for stable estimation due to additional item parameters. When applying item response theory (IRT) to small samples, it is often the case that only models with relatively few item parameters can be reliably estimated. Two recently developed asymmetric IRT models, the negative log-log and the complementary log-log, allow for different IRF shapes compared to conventional IRT models and can be fit with small samples. In this paper, we propose Bayesian model averaging (BMA) of simple symmetric and asymmetric IRT models to explore item asymmetry and to flexibly estimate IRFs in small samples. We also consider model averaging at both the item level and the test level. We first show the feasibility of the approach with an empirical example. Then, in a simulation study involving complex data-generating conditions and small sample sizes (i.e., 100 and 250), we show that averaging methods recover asymmetry in the data-generating process and consistently outperform model selection and kernel smoothing. The methods proposed in this study are a practical alternative to more complex asymmetric IRT models and may also be a useful method in exploratory semi-parametric IRT analysis.
{"title":"Bayesian model averaging of (a)symmetric item response models in small samples.","authors":"Fabio Setti, Leah Feuerstahler","doi":"10.1111/bmsp.70024","DOIUrl":"https://doi.org/10.1111/bmsp.70024","url":null,"abstract":"<p><p>Asymmetric IRT models present theoretically desirable features, but often require large sample sizes for stable estimation due to additional item parameters. When applying item response theory (IRT) to small samples, it is often the case that only models with relatively few item parameters can be reliably estimated. Two recently developed asymmetric IRT models, the negative log-log and the complementary log-log, allow for different IRF shapes compared to conventional IRT models and can be fit with small samples. In this paper, we propose Bayesian model averaging (BMA) of simple symmetric and asymmetric IRT models to explore item asymmetry and to flexibly estimate IRFs in small samples. We also consider model averaging at both the item level and the test level. We first show the feasibility of the approach with an empirical example. Then, in a simulation study involving complex data-generating conditions and small sample sizes (i.e., 100 and 250), we show that averaging methods recover asymmetry in the data-generating process and consistently outperform model selection and kernel smoothing. The methods proposed in this study are a practical alternative to more complex asymmetric IRT models and may also be a useful method in exploratory semi-parametric IRT analysis.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential item functioning (DIF) can be investigated by estimating item response theory (IRT) parameters separately for different respondent groups, thus allowing for the detection of discrepancies in parameter estimates across groups. However, before comparing the estimates, it is necessary to convert them to a common metric due to the constraints required to identify the model. These processes influence each other, as the presence of DIF items affects the estimation of scale conversion. This paper proposes a novel method that simultaneously performs scale conversion and DIF detection. By doing so, the estimated scale conversion automatically takes into account the presence of DIF. The differences of the item parameter estimates across groups can be explained through variables at the within-group item level or by the group itself. Penalized likelihood estimation is used to perform an automatic selection of the item parameters that differ in some groups. Real-data applications and simulation studies show the good performance of the proposal.
{"title":"Differential item functioning detection across multiple groups.","authors":"Michela Battauz","doi":"10.1111/bmsp.70023","DOIUrl":"https://doi.org/10.1111/bmsp.70023","url":null,"abstract":"<p><p>Differential item functioning (DIF) can be investigated by estimating item response theory (IRT) parameters separately for different respondent groups, thus allowing for the detection of discrepancies in parameter estimates across groups. However, before comparing the estimates, it is necessary to convert them to a common metric due to the constraints required to identify the model. These processes influence each other, as the presence of DIF items affects the estimation of scale conversion. This paper proposes a novel method that simultaneously performs scale conversion and DIF detection. By doing so, the estimated scale conversion automatically takes into account the presence of DIF. The differences of the item parameter estimates across groups can be explained through variables at the within-group item level or by the group itself. Penalized likelihood estimation is used to perform an automatic selection of the item parameters that differ in some groups. Real-data applications and simulation studies show the good performance of the proposal.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Q-matrix of a cognitively diagnostic assessment (CDA), documenting the item-attribute associations, is a key component of any CDA. However, the true Q-matrix underlying a CDA is never known and must be estimated-typically by content experts. However, due to fallible human judgment, misspecifications of the Q-matrix may occur, resulting in the misclassification of examinees. In response to this challenge, algorithms have been developed to estimate the Q-matrix from item responses. Some algorithms impose identifiability conditions while others do not. The debate about which is "right" is ongoing; especially, since these conditions are sufficient but not necessary, which means viable alternative Q-matrix estimates may be ignored. In this study, the performance of Q-matrix estimation algorithms that impose identifiability conditions on the Q-matrix estimate was compared with that of estimation algorithms which do not impose such identifiability conditions. Large-scale simulations examined the impact of factors like sample size, test length, attributes, or error levels. The estimated Q-matrices were evaluated for meeting identifiability conditions and their accuracy in classifying examinees. The simulation results showed that for the various estimation algorithms studied here, imposing identifiability conditions on Q-matrix estimation did not change outcomes with respect to identifiability or examinee classification.
{"title":"Identifiability conditions in cognitive diagnosis: Implications for Q-matrix estimation algorithms.","authors":"Hyunjoo Kim, Hans Friedrich Köhn, Chia-Yi Chiu","doi":"10.1111/bmsp.70020","DOIUrl":"https://doi.org/10.1111/bmsp.70020","url":null,"abstract":"<p><p>The Q-matrix of a cognitively diagnostic assessment (CDA), documenting the item-attribute associations, is a key component of any CDA. However, the true Q-matrix underlying a CDA is never known and must be estimated-typically by content experts. However, due to fallible human judgment, misspecifications of the Q-matrix may occur, resulting in the misclassification of examinees. In response to this challenge, algorithms have been developed to estimate the Q-matrix from item responses. Some algorithms impose identifiability conditions while others do not. The debate about which is \"right\" is ongoing; especially, since these conditions are sufficient but not necessary, which means viable alternative Q-matrix estimates may be ignored. In this study, the performance of Q-matrix estimation algorithms that impose identifiability conditions on the Q-matrix estimate was compared with that of estimation algorithms which do not impose such identifiability conditions. Large-scale simulations examined the impact of factors like sample size, test length, attributes, or error levels. The estimated Q-matrices were evaluated for meeting identifiability conditions and their accuracy in classifying examinees. The simulation results showed that for the various estimation algorithms studied here, imposing identifiability conditions on Q-matrix estimation did not change outcomes with respect to identifiability or examinee classification.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the present study, we extend a stochastic differential equation (SDE) model, the Ornstein-Uhlenbeck (OU) process, to the simultaneous analysis of time series of multiple variables by means of random effects for individuals and variables using a Bayesian framework. This SDE model is a stationary Gauss-Markov process that varies over time around its mean. Our extension allows us to estimate the variability of different parameters of the process, such as the mean (μ) or the drift parameter (φ), across individuals and variables of the system by means of marginalized posterior distributions. We illustrate the estimations and the interpretability of the parameters of this multilevel OU process in an empirical study of affect dynamics where multiple individuals were measured on different variables at multiple time points. We also conducted a simulation study to evaluate whether the model can recover the population parameters generating the OU process. Our results support the use of this model to obtain both the general parameters (common to all individuals and variables) and the variable-specific point estimates (random effects). We conclude that this multilevel OU process with individual- and variable-specific estimates as random effects can be a useful approach to analyse time series for multiple variables simultaneously.
{"title":"A multilevel Ornstein-Uhlenbeck process with individual- and variable-specific estimates as random effects.","authors":"José Ángel Martínez-Huertas, Emilio Ferrer","doi":"10.1111/bmsp.70019","DOIUrl":"https://doi.org/10.1111/bmsp.70019","url":null,"abstract":"<p><p>In the present study, we extend a stochastic differential equation (SDE) model, the Ornstein-Uhlenbeck (OU) process, to the simultaneous analysis of time series of multiple variables by means of random effects for individuals and variables using a Bayesian framework. This SDE model is a stationary Gauss-Markov process that varies over time around its mean. Our extension allows us to estimate the variability of different parameters of the process, such as the mean (μ) or the drift parameter (φ), across individuals and variables of the system by means of marginalized posterior distributions. We illustrate the estimations and the interpretability of the parameters of this multilevel OU process in an empirical study of affect dynamics where multiple individuals were measured on different variables at multiple time points. We also conducted a simulation study to evaluate whether the model can recover the population parameters generating the OU process. Our results support the use of this model to obtain both the general parameters (common to all individuals and variables) and the variable-specific point estimates (random effects). We conclude that this multilevel OU process with individual- and variable-specific estimates as random effects can be a useful approach to analyse time series for multiple variables simultaneously.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliability is crucial in psychometrics, reflecting the extent to which a measurement instrument can discriminate between individuals or items. While classical test theory and intraclass correlation coefficients are well-established for quantitative scales, estimating reliability for binary outcomes presents unique challenges due to their discrete nature. This paper reviews and links three major approaches to estimate reliability for single ratings on binary scales: the normal approximation approach, kappa coefficients, and the latent variable approach, which enables estimation at both latent and manifest scale levels. We clarify their conceptual relationships, show conditions for asymptotical equivalence, and evaluate their performance across two common study designs, repeatability and reproducibility studies. Then, we extend the Bayesian Dirichlet-multinomial method for estimating kappa coefficients to settings with more than two replicates, without requiring Bayesian software. Additionally, we introduce a Bayesian method to estimate manifest scale reliability from latent scale reliability that can be implemented in standard Bayesian software. A simulation study compares the statistical properties of the three major approaches across Bayesian and frequentist frameworks. Overall, the normal approximation approach performed poorly, and the frequentist approach was unreliable due to singularity issues. The findings offer further refined practical recommendations.
{"title":"From tetrachoric to kappa: How to assess reliability on binary scales.","authors":"Sophie Vanbelle","doi":"10.1111/bmsp.70021","DOIUrl":"https://doi.org/10.1111/bmsp.70021","url":null,"abstract":"<p><p>Reliability is crucial in psychometrics, reflecting the extent to which a measurement instrument can discriminate between individuals or items. While classical test theory and intraclass correlation coefficients are well-established for quantitative scales, estimating reliability for binary outcomes presents unique challenges due to their discrete nature. This paper reviews and links three major approaches to estimate reliability for single ratings on binary scales: the normal approximation approach, kappa coefficients, and the latent variable approach, which enables estimation at both latent and manifest scale levels. We clarify their conceptual relationships, show conditions for asymptotical equivalence, and evaluate their performance across two common study designs, repeatability and reproducibility studies. Then, we extend the Bayesian Dirichlet-multinomial method for estimating kappa coefficients to settings with more than two replicates, without requiring Bayesian software. Additionally, we introduce a Bayesian method to estimate manifest scale reliability from latent scale reliability that can be implemented in standard Bayesian software. A simulation study compares the statistical properties of the three major approaches across Bayesian and frequentist frameworks. Overall, the normal approximation approach performed poorly, and the frequentist approach was unreliable due to singularity issues. The findings offer further refined practical recommendations.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dipro Mondal, Alberto Cassese, Math J J M Candel, Sophie Vanbelle
Reliability evaluation is critical in fields such as psychology and medicine to ensure accurate diagnosis and effective treatment management. When participants are evaluated by the same raters, a two-way ANOVA model is suitable to model the data, with the intraclass correlation coefficient (ICC) serving as the reliability metric. In these domains, the ICC for agreement (ICCa) is commonly used, as the values of the measurements themselves are of interest. Designing such reliability studies requires determining the sample size of participants and raters for the ICCa. Although procedures for sample size determination exist based on the expected width of the confidence interval for the ICCa, there is limited work on hypothesis testing. This paper addresses this gap by proposing procedures to ensure sufficient power to statistically test whether the ICCa exceeds a predetermined value, utilizing confidence intervals for the ICCa. We compared the available confidence interval methods for the ICCa and proposed sample size procedures using the lower confidence limit of the best performing methods. These procedures were evaluated considering the empirical power of the hypothesis test under various parameter configurations. Furthermore, these procedures are implemented in an interactive R shiny app, freely available to researchers for determining sample sizes.
{"title":"Sample size determination for hypothesis testing on the intraclass correlation coefficient in a two-way analysis of variance model.","authors":"Dipro Mondal, Alberto Cassese, Math J J M Candel, Sophie Vanbelle","doi":"10.1111/bmsp.70016","DOIUrl":"https://doi.org/10.1111/bmsp.70016","url":null,"abstract":"<p><p>Reliability evaluation is critical in fields such as psychology and medicine to ensure accurate diagnosis and effective treatment management. When participants are evaluated by the same raters, a two-way ANOVA model is suitable to model the data, with the intraclass correlation coefficient (ICC) serving as the reliability metric. In these domains, the ICC for agreement (ICCa) is commonly used, as the values of the measurements themselves are of interest. Designing such reliability studies requires determining the sample size of participants and raters for the ICCa. Although procedures for sample size determination exist based on the expected width of the confidence interval for the ICCa, there is limited work on hypothesis testing. This paper addresses this gap by proposing procedures to ensure sufficient power to statistically test whether the ICCa exceeds a predetermined value, utilizing confidence intervals for the ICCa. We compared the available confidence interval methods for the ICCa and proposed sample size procedures using the lower confidence limit of the best performing methods. These procedures were evaluated considering the empirical power of the hypothesis test under various parameter configurations. Furthermore, these procedures are implemented in an interactive R shiny app, freely available to researchers for determining sample sizes.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145524936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces two new Item Response Theory (IRT) models, based on the Generalized Extreme Value (GEV) distribution. These new models have asymmetric item characteristic curves (ICC) which have drawn growing interest, as they may better model actual item response behaviours in specific scenarios. The analysis of the models is carried out using a Bayesian approach, and their properties are examined and discussed. The validity of the models is verified by means of extensive simulation studies to evaluate the sensitivity of the model to the choice of priors on the new item parameter introduced, the accuracy of the parameters' recovery, as well as an assessment of the capacity of model comparison criteria to choose the best model against other IRT models. The new models are exemplified using real data from two mathematics tests, one applied in Peruvian public schools and another one administered to incoming university students in Chile. In both cases, the proposed models showed to be a promising alternative to asymmetric IRT models, offering new insights into item response modelling.
{"title":"Generalized extreme value IRT models.","authors":"Jessica Alves, Jorge Bazán, Jorge González","doi":"10.1111/bmsp.70015","DOIUrl":"https://doi.org/10.1111/bmsp.70015","url":null,"abstract":"<p><p>This paper introduces two new Item Response Theory (IRT) models, based on the Generalized Extreme Value (GEV) distribution. These new models have asymmetric item characteristic curves (ICC) which have drawn growing interest, as they may better model actual item response behaviours in specific scenarios. The analysis of the models is carried out using a Bayesian approach, and their properties are examined and discussed. The validity of the models is verified by means of extensive simulation studies to evaluate the sensitivity of the model to the choice of priors on the new item parameter introduced, the accuracy of the parameters' recovery, as well as an assessment of the capacity of model comparison criteria to choose the best model against other IRT models. The new models are exemplified using real data from two mathematics tests, one applied in Peruvian public schools and another one administered to incoming university students in Chile. In both cases, the proposed models showed to be a promising alternative to asymmetric IRT models, offering new insights into item response modelling.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145497315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Debora de Chiusole, Andrea Spoto, Umberto Granziol, Luca Stefanutti
In knowledge structure theory (KST) framework, this study evaluates the reliability of knowledge state estimation by introducing two key measures: the expected accuracy rate and the expected discrepancy. The accuracy rate quantifies the likelihood that the estimated knowledge state aligns with the true state, while the expected discrepancy measures the average deviation when misclassification occurs. To support the theoretical framework, we provide an in-depth discussion of these indices, supplemented by two simulation studies and an empirical example. The simulation results reveal a trade-off between the number of items and the size of the knowledge structure. Specifically, smaller structures exhibit consistent accuracy across different error levels, while larger structures show increasing discrepancies as error rates rise. Nevertheless, accuracy improves with a greater number of items in larger structures, mitigating the impact of errors. Additionally, the expected discrepancy analysis shows that when misclassification occurs, the estimated state is generally close to the true one, minimizing the effect of errors in the assessment. Finally, an empirical application using real assessment data demonstrates the practical relevance of the proposed measures. This suggests that KST-based assessments provide reliable and meaningful diagnostic information, highlighting their potential for use in educational and psychological testing.
{"title":"Reliability measures in knowledge structure theory.","authors":"Debora de Chiusole, Andrea Spoto, Umberto Granziol, Luca Stefanutti","doi":"10.1111/bmsp.70013","DOIUrl":"https://doi.org/10.1111/bmsp.70013","url":null,"abstract":"<p><p>In knowledge structure theory (KST) framework, this study evaluates the reliability of knowledge state estimation by introducing two key measures: the expected accuracy rate and the expected discrepancy. The accuracy rate quantifies the likelihood that the estimated knowledge state aligns with the true state, while the expected discrepancy measures the average deviation when misclassification occurs. To support the theoretical framework, we provide an in-depth discussion of these indices, supplemented by two simulation studies and an empirical example. The simulation results reveal a trade-off between the number of items and the size of the knowledge structure. Specifically, smaller structures exhibit consistent accuracy across different error levels, while larger structures show increasing discrepancies as error rates rise. Nevertheless, accuracy improves with a greater number of items in larger structures, mitigating the impact of errors. Additionally, the expected discrepancy analysis shows that when misclassification occurs, the estimated state is generally close to the true one, minimizing the effect of errors in the assessment. Finally, an empirical application using real assessment data demonstrates the practical relevance of the proposed measures. This suggests that KST-based assessments provide reliable and meaningful diagnostic information, highlighting their potential for use in educational and psychological testing.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}