The growing popularity of the ecological momentary assessment method in psychological research requires adequate statistical models for intensive longitudinal data (ILD), with multilevel latent state-trait (ML-LST) models based on the latent state-trait theory revised (LST-R theory) as one possible alternative. Besides the traditional LST-R coefficients reliability, consistency and occasion-specificity, ML-LST models are also suitable for estimating reliability at Level 1 ("within-subject reliability") and Level 2 ("between-subject reliability"). However, these level-specific coefficients have not yet been defined in LST-R theory and, therefore, their interpretation has been unclear from the perspective of LST-R theory. In the current study, we discuss the interpretation and identification of these coefficients based on the (multilevel) versions of the Multistate-Singletrait (MSST), the Multistate-Indicator-specific trait (MSIT) and the Multistate-Singletrait model with M-1 correlated method factors (MSST-M-1). We show that, in the MSST-M-1 model, the between-subject coefficient is a measure of the indicator-unspecificity of an item (i.e. the portion of between-level variance that a specific item shares with a common trait) or the unidimensionality of a scale. Moreover, we highlight differences between occasion-specificity and within-subject reliability. The performance of the ML-MSST-M-1 model and the corresponding theoretical findings are illustrated using data from an experience sampling study on the within-person fluctuations of narcissistic admiration (Heyde et al., 2023).
随着生态瞬时评价方法在心理学研究领域的日益普及,对密集纵向数据(ILD)的统计模型提出了要求,基于潜态-特质修正理论(LST-R理论)的多层次潜态-特质(ML-LST)模型是一种可能的选择。除了传统的LST-R系数信度、一致性和场合特异性外,ML-LST模型也适用于估计一级(“主体内信度”)和二级(“主体间信度”)的信度。然而,这些水平特异性系数在LST-R理论中尚未定义,因此从LST-R理论的角度对其解释尚不明确。在本研究中,我们基于多状态-单性状(MSST)、多状态-指标特异性性状(MSIT)和具有M-1相关方法因子的多状态-单性状模型(MSST-M-1)的(多水平)版本讨论了这些系数的解释和识别。我们表明,在mst - m -1模型中,被试间系数是衡量一个项目的指标非特异性(即一个特定项目与一个共同特征共享的水平间方差的部分)或量表的单维性。此外,我们强调了场合特异性和主体内信度之间的差异。ml - mst - m -1模型的性能和相应的理论发现使用了一项关于自恋崇拜的个人波动的经验抽样研究的数据(Heyde et al., 2023)。
{"title":"Level-specific reliability coefficients from the perspective of latent state-trait theory.","authors":"Lennart Nacke, Axel Mayer","doi":"10.1111/bmsp.70027","DOIUrl":"https://doi.org/10.1111/bmsp.70027","url":null,"abstract":"<p><p>The growing popularity of the ecological momentary assessment method in psychological research requires adequate statistical models for intensive longitudinal data (ILD), with multilevel latent state-trait (ML-LST) models based on the latent state-trait theory revised (LST-R theory) as one possible alternative. Besides the traditional LST-R coefficients reliability, consistency and occasion-specificity, ML-LST models are also suitable for estimating reliability at Level 1 (\"within-subject reliability\") and Level 2 (\"between-subject reliability\"). However, these level-specific coefficients have not yet been defined in LST-R theory and, therefore, their interpretation has been unclear from the perspective of LST-R theory. In the current study, we discuss the interpretation and identification of these coefficients based on the (multilevel) versions of the Multistate-Singletrait (MSST), the Multistate-Indicator-specific trait (MSIT) and the Multistate-Singletrait model with M-1 correlated method factors (MSST-M-1). We show that, in the MSST-M-1 model, the between-subject coefficient is a measure of the indicator-unspecificity of an item (i.e. the portion of between-level variance that a specific item shares with a common trait) or the unidimensionality of a scale. Moreover, we highlight differences between occasion-specificity and within-subject reliability. The performance of the ML-MSST-M-1 model and the corresponding theoretical findings are illustrated using data from an experience sampling study on the within-person fluctuations of narcissistic admiration (Heyde et al., 2023).</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Latent variable models typically require large sample sizes for acceptable efficiency and reliable convergence. Appropriate informative priors are often required for gainfully employing Bayesian analysis with small samples. Power priors are informative priors built on historical data, weighted to account for non-exchangeability with the current sample. Many extant power prior approaches are designed for manifest variable models, and are not easily adapted for latent variable models, for example, they may require integration over all model parameters. We examined two recent power prior approaches straightforward to adapt to these models, Mahalanobis weight (MW) priors based on Golchi (Use of historical individual patient data in analysis of clinical trials, 2020), and univariate priors, based on Finch (The Psychiatrist, 6, 2024, 45)'s application of Haddad et al. (Journal of Biopharmaceutical Statistics, 27, 2017, 1089) and Balcome et al. (bayesdp: Implementation of the Bayesian discount prior approach for clinical trials, 2022). We applied these approaches along with diffuse and weakly informative priors to a latent variable mediation model, under various sample sizes and non-exchangeability conditions. We compared their performances in terms of convergence, bias, efficiency, and credible interval coverage when estimating an indirect effect. Diffuse priors and the univariate approach lead to poor convergence. The weakly informative and MW approach both improved convergence and yielded reasonable estimates, but MW performed poorly under some non-exchangeable conditions. We discussed the issues with these approaches and future research directions.
潜在变量模型通常需要较大的样本量才能获得可接受的效率和可靠的收敛性。适当的信息先验通常需要在小样本中有效地使用贝叶斯分析。功率先验是建立在历史数据基础上的信息先验,加权以考虑与当前样本的不可交换性。许多现有的幂先验方法是为明显变量模型设计的,并且不容易适用于潜在变量模型,例如,它们可能需要对所有模型参数进行集成。我们研究了最近两种直接适应这些模型的功率先验方法,基于Golchi的Mahalanobis权重(MW)先验(在临床试验分析中使用个体患者的历史数据,2020),以及基于Finch (The psychiatry, 6,2024, 45)应用Haddad等人(Journal of biopharmacicalstatistics, 27,2017, 1089)和Balcome等人(bayesdp:临床试验贝叶斯折扣先验方法的实现,2022)的单变量先验。在不同样本量和不可交换性条件下,我们将这些方法与弥漫性和弱信息先验一起应用于潜在变量中介模型。在估计间接影响时,我们比较了它们在收敛性、偏差、效率和可信区间覆盖方面的表现。扩散先验和单变量方法导致收敛性差。弱信息方法和最小估计方法都提高了收敛性并产生了合理的估计,但最小估计方法在一些非交换条件下表现不佳。讨论了这些方法存在的问题和未来的研究方向。
{"title":"Power priors for latent variable mediation models under small sample sizes.","authors":"Lihan Chen, Milica Miočević, Carl F Falk","doi":"10.1111/bmsp.70025","DOIUrl":"https://doi.org/10.1111/bmsp.70025","url":null,"abstract":"<p><p>Latent variable models typically require large sample sizes for acceptable efficiency and reliable convergence. Appropriate informative priors are often required for gainfully employing Bayesian analysis with small samples. Power priors are informative priors built on historical data, weighted to account for non-exchangeability with the current sample. Many extant power prior approaches are designed for manifest variable models, and are not easily adapted for latent variable models, for example, they may require integration over all model parameters. We examined two recent power prior approaches straightforward to adapt to these models, Mahalanobis weight (MW) priors based on Golchi (Use of historical individual patient data in analysis of clinical trials, 2020), and univariate priors, based on Finch (The Psychiatrist, 6, 2024, 45)'s application of Haddad et al. (Journal of Biopharmaceutical Statistics, 27, 2017, 1089) and Balcome et al. (bayesdp: Implementation of the Bayesian discount prior approach for clinical trials, 2022). We applied these approaches along with diffuse and weakly informative priors to a latent variable mediation model, under various sample sizes and non-exchangeability conditions. We compared their performances in terms of convergence, bias, efficiency, and credible interval coverage when estimating an indirect effect. Diffuse priors and the univariate approach lead to poor convergence. The weakly informative and MW approach both improved convergence and yielded reasonable estimates, but MW performed poorly under some non-exchangeable conditions. We discussed the issues with these approaches and future research directions.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann
Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.
{"title":"Idiographic interrater reliability measures for intensive longitudinal multirater data.","authors":"Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann","doi":"10.1111/bmsp.70022","DOIUrl":"https://doi.org/10.1111/bmsp.70022","url":null,"abstract":"<p><p>Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Asymmetric IRT models present theoretically desirable features, but often require large sample sizes for stable estimation due to additional item parameters. When applying item response theory (IRT) to small samples, it is often the case that only models with relatively few item parameters can be reliably estimated. Two recently developed asymmetric IRT models, the negative log-log and the complementary log-log, allow for different IRF shapes compared to conventional IRT models and can be fit with small samples. In this paper, we propose Bayesian model averaging (BMA) of simple symmetric and asymmetric IRT models to explore item asymmetry and to flexibly estimate IRFs in small samples. We also consider model averaging at both the item level and the test level. We first show the feasibility of the approach with an empirical example. Then, in a simulation study involving complex data-generating conditions and small sample sizes (i.e., 100 and 250), we show that averaging methods recover asymmetry in the data-generating process and consistently outperform model selection and kernel smoothing. The methods proposed in this study are a practical alternative to more complex asymmetric IRT models and may also be a useful method in exploratory semi-parametric IRT analysis.
{"title":"Bayesian model averaging of (a)symmetric item response models in small samples.","authors":"Fabio Setti, Leah Feuerstahler","doi":"10.1111/bmsp.70024","DOIUrl":"https://doi.org/10.1111/bmsp.70024","url":null,"abstract":"<p><p>Asymmetric IRT models present theoretically desirable features, but often require large sample sizes for stable estimation due to additional item parameters. When applying item response theory (IRT) to small samples, it is often the case that only models with relatively few item parameters can be reliably estimated. Two recently developed asymmetric IRT models, the negative log-log and the complementary log-log, allow for different IRF shapes compared to conventional IRT models and can be fit with small samples. In this paper, we propose Bayesian model averaging (BMA) of simple symmetric and asymmetric IRT models to explore item asymmetry and to flexibly estimate IRFs in small samples. We also consider model averaging at both the item level and the test level. We first show the feasibility of the approach with an empirical example. Then, in a simulation study involving complex data-generating conditions and small sample sizes (i.e., 100 and 250), we show that averaging methods recover asymmetry in the data-generating process and consistently outperform model selection and kernel smoothing. The methods proposed in this study are a practical alternative to more complex asymmetric IRT models and may also be a useful method in exploratory semi-parametric IRT analysis.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145770060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential item functioning (DIF) can be investigated by estimating item response theory (IRT) parameters separately for different respondent groups, thus allowing for the detection of discrepancies in parameter estimates across groups. However, before comparing the estimates, it is necessary to convert them to a common metric due to the constraints required to identify the model. These processes influence each other, as the presence of DIF items affects the estimation of scale conversion. This paper proposes a novel method that simultaneously performs scale conversion and DIF detection. By doing so, the estimated scale conversion automatically takes into account the presence of DIF. The differences of the item parameter estimates across groups can be explained through variables at the within-group item level or by the group itself. Penalized likelihood estimation is used to perform an automatic selection of the item parameters that differ in some groups. Real-data applications and simulation studies show the good performance of the proposal.
{"title":"Differential item functioning detection across multiple groups.","authors":"Michela Battauz","doi":"10.1111/bmsp.70023","DOIUrl":"https://doi.org/10.1111/bmsp.70023","url":null,"abstract":"<p><p>Differential item functioning (DIF) can be investigated by estimating item response theory (IRT) parameters separately for different respondent groups, thus allowing for the detection of discrepancies in parameter estimates across groups. However, before comparing the estimates, it is necessary to convert them to a common metric due to the constraints required to identify the model. These processes influence each other, as the presence of DIF items affects the estimation of scale conversion. This paper proposes a novel method that simultaneously performs scale conversion and DIF detection. By doing so, the estimated scale conversion automatically takes into account the presence of DIF. The differences of the item parameter estimates across groups can be explained through variables at the within-group item level or by the group itself. Penalized likelihood estimation is used to perform an automatic selection of the item parameters that differ in some groups. Real-data applications and simulation studies show the good performance of the proposal.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145764603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Q-matrix of a cognitively diagnostic assessment (CDA), documenting the item-attribute associations, is a key component of any CDA. However, the true Q-matrix underlying a CDA is never known and must be estimated-typically by content experts. However, due to fallible human judgment, misspecifications of the Q-matrix may occur, resulting in the misclassification of examinees. In response to this challenge, algorithms have been developed to estimate the Q-matrix from item responses. Some algorithms impose identifiability conditions while others do not. The debate about which is "right" is ongoing; especially, since these conditions are sufficient but not necessary, which means viable alternative Q-matrix estimates may be ignored. In this study, the performance of Q-matrix estimation algorithms that impose identifiability conditions on the Q-matrix estimate was compared with that of estimation algorithms which do not impose such identifiability conditions. Large-scale simulations examined the impact of factors like sample size, test length, attributes, or error levels. The estimated Q-matrices were evaluated for meeting identifiability conditions and their accuracy in classifying examinees. The simulation results showed that for the various estimation algorithms studied here, imposing identifiability conditions on Q-matrix estimation did not change outcomes with respect to identifiability or examinee classification.
{"title":"Identifiability conditions in cognitive diagnosis: Implications for Q-matrix estimation algorithms.","authors":"Hyunjoo Kim, Hans Friedrich Köhn, Chia-Yi Chiu","doi":"10.1111/bmsp.70020","DOIUrl":"https://doi.org/10.1111/bmsp.70020","url":null,"abstract":"<p><p>The Q-matrix of a cognitively diagnostic assessment (CDA), documenting the item-attribute associations, is a key component of any CDA. However, the true Q-matrix underlying a CDA is never known and must be estimated-typically by content experts. However, due to fallible human judgment, misspecifications of the Q-matrix may occur, resulting in the misclassification of examinees. In response to this challenge, algorithms have been developed to estimate the Q-matrix from item responses. Some algorithms impose identifiability conditions while others do not. The debate about which is \"right\" is ongoing; especially, since these conditions are sufficient but not necessary, which means viable alternative Q-matrix estimates may be ignored. In this study, the performance of Q-matrix estimation algorithms that impose identifiability conditions on the Q-matrix estimate was compared with that of estimation algorithms which do not impose such identifiability conditions. Large-scale simulations examined the impact of factors like sample size, test length, attributes, or error levels. The estimated Q-matrices were evaluated for meeting identifiability conditions and their accuracy in classifying examinees. The simulation results showed that for the various estimation algorithms studied here, imposing identifiability conditions on Q-matrix estimation did not change outcomes with respect to identifiability or examinee classification.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145745878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the present study, we extend a stochastic differential equation (SDE) model, the Ornstein-Uhlenbeck (OU) process, to the simultaneous analysis of time series of multiple variables by means of random effects for individuals and variables using a Bayesian framework. This SDE model is a stationary Gauss-Markov process that varies over time around its mean. Our extension allows us to estimate the variability of different parameters of the process, such as the mean (μ) or the drift parameter (φ), across individuals and variables of the system by means of marginalized posterior distributions. We illustrate the estimations and the interpretability of the parameters of this multilevel OU process in an empirical study of affect dynamics where multiple individuals were measured on different variables at multiple time points. We also conducted a simulation study to evaluate whether the model can recover the population parameters generating the OU process. Our results support the use of this model to obtain both the general parameters (common to all individuals and variables) and the variable-specific point estimates (random effects). We conclude that this multilevel OU process with individual- and variable-specific estimates as random effects can be a useful approach to analyse time series for multiple variables simultaneously.
{"title":"A multilevel Ornstein-Uhlenbeck process with individual- and variable-specific estimates as random effects.","authors":"José Ángel Martínez-Huertas, Emilio Ferrer","doi":"10.1111/bmsp.70019","DOIUrl":"https://doi.org/10.1111/bmsp.70019","url":null,"abstract":"<p><p>In the present study, we extend a stochastic differential equation (SDE) model, the Ornstein-Uhlenbeck (OU) process, to the simultaneous analysis of time series of multiple variables by means of random effects for individuals and variables using a Bayesian framework. This SDE model is a stationary Gauss-Markov process that varies over time around its mean. Our extension allows us to estimate the variability of different parameters of the process, such as the mean (μ) or the drift parameter (φ), across individuals and variables of the system by means of marginalized posterior distributions. We illustrate the estimations and the interpretability of the parameters of this multilevel OU process in an empirical study of affect dynamics where multiple individuals were measured on different variables at multiple time points. We also conducted a simulation study to evaluate whether the model can recover the population parameters generating the OU process. Our results support the use of this model to obtain both the general parameters (common to all individuals and variables) and the variable-specific point estimates (random effects). We conclude that this multilevel OU process with individual- and variable-specific estimates as random effects can be a useful approach to analyse time series for multiple variables simultaneously.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reliability is crucial in psychometrics, reflecting the extent to which a measurement instrument can discriminate between individuals or items. While classical test theory and intraclass correlation coefficients are well-established for quantitative scales, estimating reliability for binary outcomes presents unique challenges due to their discrete nature. This paper reviews and links three major approaches to estimate reliability for single ratings on binary scales: the normal approximation approach, kappa coefficients, and the latent variable approach, which enables estimation at both latent and manifest scale levels. We clarify their conceptual relationships, show conditions for asymptotical equivalence, and evaluate their performance across two common study designs, repeatability and reproducibility studies. Then, we extend the Bayesian Dirichlet-multinomial method for estimating kappa coefficients to settings with more than two replicates, without requiring Bayesian software. Additionally, we introduce a Bayesian method to estimate manifest scale reliability from latent scale reliability that can be implemented in standard Bayesian software. A simulation study compares the statistical properties of the three major approaches across Bayesian and frequentist frameworks. Overall, the normal approximation approach performed poorly, and the frequentist approach was unreliable due to singularity issues. The findings offer further refined practical recommendations.
{"title":"From tetrachoric to kappa: How to assess reliability on binary scales.","authors":"Sophie Vanbelle","doi":"10.1111/bmsp.70021","DOIUrl":"https://doi.org/10.1111/bmsp.70021","url":null,"abstract":"<p><p>Reliability is crucial in psychometrics, reflecting the extent to which a measurement instrument can discriminate between individuals or items. While classical test theory and intraclass correlation coefficients are well-established for quantitative scales, estimating reliability for binary outcomes presents unique challenges due to their discrete nature. This paper reviews and links three major approaches to estimate reliability for single ratings on binary scales: the normal approximation approach, kappa coefficients, and the latent variable approach, which enables estimation at both latent and manifest scale levels. We clarify their conceptual relationships, show conditions for asymptotical equivalence, and evaluate their performance across two common study designs, repeatability and reproducibility studies. Then, we extend the Bayesian Dirichlet-multinomial method for estimating kappa coefficients to settings with more than two replicates, without requiring Bayesian software. Additionally, we introduce a Bayesian method to estimate manifest scale reliability from latent scale reliability that can be implemented in standard Bayesian software. A simulation study compares the statistical properties of the three major approaches across Bayesian and frequentist frameworks. Overall, the normal approximation approach performed poorly, and the frequentist approach was unreliable due to singularity issues. The findings offer further refined practical recommendations.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145702732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dipro Mondal, Alberto Cassese, Math J J M Candel, Sophie Vanbelle
Reliability evaluation is critical in fields such as psychology and medicine to ensure accurate diagnosis and effective treatment management. When participants are evaluated by the same raters, a two-way ANOVA model is suitable to model the data, with the intraclass correlation coefficient (ICC) serving as the reliability metric. In these domains, the ICC for agreement (ICCa) is commonly used, as the values of the measurements themselves are of interest. Designing such reliability studies requires determining the sample size of participants and raters for the ICCa. Although procedures for sample size determination exist based on the expected width of the confidence interval for the ICCa, there is limited work on hypothesis testing. This paper addresses this gap by proposing procedures to ensure sufficient power to statistically test whether the ICCa exceeds a predetermined value, utilizing confidence intervals for the ICCa. We compared the available confidence interval methods for the ICCa and proposed sample size procedures using the lower confidence limit of the best performing methods. These procedures were evaluated considering the empirical power of the hypothesis test under various parameter configurations. Furthermore, these procedures are implemented in an interactive R shiny app, freely available to researchers for determining sample sizes.
{"title":"Sample size determination for hypothesis testing on the intraclass correlation coefficient in a two-way analysis of variance model.","authors":"Dipro Mondal, Alberto Cassese, Math J J M Candel, Sophie Vanbelle","doi":"10.1111/bmsp.70016","DOIUrl":"https://doi.org/10.1111/bmsp.70016","url":null,"abstract":"<p><p>Reliability evaluation is critical in fields such as psychology and medicine to ensure accurate diagnosis and effective treatment management. When participants are evaluated by the same raters, a two-way ANOVA model is suitable to model the data, with the intraclass correlation coefficient (ICC) serving as the reliability metric. In these domains, the ICC for agreement (ICCa) is commonly used, as the values of the measurements themselves are of interest. Designing such reliability studies requires determining the sample size of participants and raters for the ICCa. Although procedures for sample size determination exist based on the expected width of the confidence interval for the ICCa, there is limited work on hypothesis testing. This paper addresses this gap by proposing procedures to ensure sufficient power to statistically test whether the ICCa exceeds a predetermined value, utilizing confidence intervals for the ICCa. We compared the available confidence interval methods for the ICCa and proposed sample size procedures using the lower confidence limit of the best performing methods. These procedures were evaluated considering the empirical power of the hypothesis test under various parameter configurations. Furthermore, these procedures are implemented in an interactive R shiny app, freely available to researchers for determining sample sizes.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145524936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}