Item response theory (IRT) model is a widely appreciated statistical method in exploring the relationship between individual latent traits and item responses. In this paper, a sparse IRT model is established to address the sparsity of factor loadings. A global and local shrinkage prior is imposed to penalize the factor loadings: the global parameter controls the amount of shrinkage at the column levels, while the local parameter adjusts the penalty of factor loadings within each column. We develop a variational Bayesian procedure to conduct posterior inference. By exploiting a stochastic representation for logistic function, we frame sparse IRT model as a mixture model mixing with Pólya-Gamma distribution. Such a strategy admits a conjugate posterior for the latent quantity, thus leading to a straightforward posterior computation. We assess the performance of the proposed method via a simulation study. A real example related to personality assessment is analysed to illustrate the usefulness of methodology.
{"title":"Variational Bayesian inference for sparse item response theory models.","authors":"Yemao Xia, Yu Xue, Depeng Jiang","doi":"10.1111/bmsp.70032","DOIUrl":"https://doi.org/10.1111/bmsp.70032","url":null,"abstract":"<p><p>Item response theory (IRT) model is a widely appreciated statistical method in exploring the relationship between individual latent traits and item responses. In this paper, a sparse IRT model is established to address the sparsity of factor loadings. A global and local shrinkage prior is imposed to penalize the factor loadings: the global parameter controls the amount of shrinkage at the column levels, while the local parameter adjusts the penalty of factor loadings within each column. We develop a variational Bayesian procedure to conduct posterior inference. By exploiting a stochastic representation for logistic function, we frame sparse IRT model as a mixture model mixing with Pólya-Gamma distribution. Such a strategy admits a conjugate posterior for the latent quantity, thus leading to a straightforward posterior computation. We assess the performance of the proposed method via a simulation study. A real example related to personality assessment is analysed to illustrate the usefulness of methodology.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146121229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in computerized assessments have enabled the use of innovative item formats (e.g., drag-and-drop, scenario-based), necessitating a flexible model that can capture systematic influence of item types on action counts. In this study, we present a refinement scheme that can explicitly model common features of items and allows inference on the item-type effects. We apply multifaceted parameterization to characterize the common and unique features of items and implement the formulation in two existing models, the Rasch and Conway-Maxwell-Poisson count models. The inference procedures for the proposed models are presented using Stan and validated for estimation accuracy. Numerical experimentation with simulated data suggest that the proposed inferential scheme adequately recovers the underlying model parameters. Empirical application demonstrated that the proposed refinement holds practical relevance when data exhibit distinct item-type effects. Based on the findings from the empirical investigation, we discuss practical considerations in applying the Poisson models for analysing count data.
{"title":"Latent Poisson count models for action count data from technology-enhanced assessments.","authors":"Gregory Arbet, Hyeon-Ah Kang","doi":"10.1111/bmsp.70036","DOIUrl":"https://doi.org/10.1111/bmsp.70036","url":null,"abstract":"<p><p>Recent advances in computerized assessments have enabled the use of innovative item formats (e.g., drag-and-drop, scenario-based), necessitating a flexible model that can capture systematic influence of item types on action counts. In this study, we present a refinement scheme that can explicitly model common features of items and allows inference on the item-type effects. We apply multifaceted parameterization to characterize the common and unique features of items and implement the formulation in two existing models, the Rasch and Conway-Maxwell-Poisson count models. The inference procedures for the proposed models are presented using Stan and validated for estimation accuracy. Numerical experimentation with simulated data suggest that the proposed inferential scheme adequately recovers the underlying model parameters. Empirical application demonstrated that the proposed refinement holds practical relevance when data exhibit distinct item-type effects. Based on the findings from the empirical investigation, we discuss practical considerations in applying the Poisson models for analysing count data.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146114993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Constructed-response (CR) items are widely used to assess higher order skills but require human scoring, which introduces variability and is costly at scale. Machine learning (ML)-based scoring offers a scalable alternative, yet its psychometric consequences in rater-mediated models remain underexplored. This study examines how scoring design, rater bias, ML inconsistency and model specification affect the reliability of ability estimation in polytomous CR assessments. Using Monte Carlo simulation, we manipulated human and ML rater bias, ML inconsistency and scoring density (complete, overlapping, isolated). Five estimation models were compared, including the Partial Credit Model (PCM) with fixed thresholds and the Many-Facet Partial Credit Model (MFPCM) with and without free calibration. Results showed that systematic bias, not random inconsistency, was the main source of error. Hybrid human-ML scoring improved estimation when raters were unbiased or exhibited opposing biases, but error compounded when biases aligned. Across designs, PCM with fixed thresholds consistently outperformed more complex alternatives, while anchoring CR items to selected-response metrics stabilized MFPCM estimation. The real data application replicated these patterns. Findings show that scoring design and bias structure, rather than model complexity, drive the benefits of hybrid scoring and that anchoring offers a practical strategy for stabilizing estimation.
{"title":"Revisiting reliability with human and machine learning raters under scoring design and rater configuration in the many-facet Rasch model.","authors":"Xingyao Xiao, Richard J Patz, Mark R Wilson","doi":"10.1111/bmsp.70034","DOIUrl":"https://doi.org/10.1111/bmsp.70034","url":null,"abstract":"<p><p>Constructed-response (CR) items are widely used to assess higher order skills but require human scoring, which introduces variability and is costly at scale. Machine learning (ML)-based scoring offers a scalable alternative, yet its psychometric consequences in rater-mediated models remain underexplored. This study examines how scoring design, rater bias, ML inconsistency and model specification affect the reliability of ability estimation in polytomous CR assessments. Using Monte Carlo simulation, we manipulated human and ML rater bias, ML inconsistency and scoring density (complete, overlapping, isolated). Five estimation models were compared, including the Partial Credit Model (PCM) with fixed thresholds and the Many-Facet Partial Credit Model (MFPCM) with and without free calibration. Results showed that systematic bias, not random inconsistency, was the main source of error. Hybrid human-ML scoring improved estimation when raters were unbiased or exhibited opposing biases, but error compounded when biases aligned. Across designs, PCM with fixed thresholds consistently outperformed more complex alternatives, while anchoring CR items to selected-response metrics stabilized MFPCM estimation. The real data application replicated these patterns. Findings show that scoring design and bias structure, rather than model complexity, drive the benefits of hybrid scoring and that anchoring offers a practical strategy for stabilizing estimation.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146094849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hidden Markov diagnostic classification models capture how students' cognitive attributes evolve over time. This paper introduces a Bayesian Markov chain Monte Carlo algorithm for diagnostic classification models that jointly estimates time-varying Q matrices, latent attributes, item parameters, attribute class proportions and transition matrices across multiple occasions. Using the R package hmdcm developed for this study, Monte Carlo simulations demonstrate accurate parameter recovery, and an empirical probability-concept assessment confirmed the algorithm's ability to trace attribute trajectories, supporting its value for longitudinal diagnostic classification in both research and instructional practice.
{"title":"Bayesian inference for dynamic Q matrices and attribute trajectories in hidden Markov diagnostic classification models.","authors":"Chen-Wei Liu","doi":"10.1111/bmsp.70028","DOIUrl":"https://doi.org/10.1111/bmsp.70028","url":null,"abstract":"<p><p>Hidden Markov diagnostic classification models capture how students' cognitive attributes evolve over time. This paper introduces a Bayesian Markov chain Monte Carlo algorithm for diagnostic classification models that jointly estimates time-varying Q matrices, latent attributes, item parameters, attribute class proportions and transition matrices across multiple occasions. Using the R package hmdcm developed for this study, Monte Carlo simulations demonstrate accurate parameter recovery, and an empirical probability-concept assessment confirmed the algorithm's ability to trace attribute trajectories, supporting its value for longitudinal diagnostic classification in both research and instructional practice.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generalizability theory (G-theory) defines a statistical framework for assessing measurement reliability by decomposing observed variance into meaningful components attributable to persons, facets, and error. Classic G-theory assumes homoscedastic residual variances across measurement conditions, an assumption that is often violated in psychological and behavioural data. The main focus of this work is to extend G-theory using a mixed-effects location-scale model (MELSM) that allows residual error variance to vary systematically across conditions and persons. By modeling heteroscedasticity, we can extend the computation of condition-specific generalizability ( ) and dependability ( ) coefficients to reflect local reliability under varying degrees of measurement precision. As an illustration, we apply the model to empirical data from an EEG experiment and show that failing to account for variance heterogeneity can mask meaningful differences in measurement quality. A simulation-based decision study further demonstrates how targeted increases in measurement density can improve reliability for low-precision conditions or participants. The proposed framework retains the interpretative character of classical G-theory while enhancing its flexibility. We argue that it supports finer-grained insights on conditions that influence reliability and better-informed design decisions in psychological measurements. We discuss implications for individualized reliability assessment, adaptive measurement strategies, and future extensions to multi-facet designs.
概括性理论(G-theory)定义了一个统计框架,通过将观察到的方差分解为可归因于人、方面和误差的有意义的成分,来评估测量的可靠性。经典的g理论假设在测量条件下的残差均方差,这一假设在心理和行为数据中经常被违反。这项工作的主要重点是使用混合效应位置尺度模型(MELSM)扩展g理论,该模型允许残差方差在条件和人员之间系统地变化。通过对异方差进行建模,我们可以扩展特定条件下的广义性(G t $$ {G}_t $$)和可靠性(D t $$ {D}_t $$)系数的计算,以反映不同测量精度下的局部可靠度。作为说明,我们将该模型应用于EEG实验的经验数据,并表明未能考虑方差异质性可以掩盖测量质量的有意义差异。一项基于模拟的决策研究进一步证明了有针对性地增加测量密度可以提高低精度条件或参与者的可靠性。该框架保留了经典g理论的解释性特征,同时增强了其灵活性。我们认为,它支持对影响可靠性和更明智的心理测量设计决策的条件的更细粒度的见解。我们讨论了个性化可靠性评估、自适应测量策略和未来扩展到多面设计的含义。
{"title":"Enhancing generalizability theory with mixed-effects models for heteroscedasticity in psychological measurement: A theoretical introduction with an application from EEG data.","authors":"Philippe Rast, Peter E Clayson","doi":"10.1111/bmsp.70026","DOIUrl":"https://doi.org/10.1111/bmsp.70026","url":null,"abstract":"<p><p>Generalizability theory (G-theory) defines a statistical framework for assessing measurement reliability by decomposing observed variance into meaningful components attributable to persons, facets, and error. Classic G-theory assumes homoscedastic residual variances across measurement conditions, an assumption that is often violated in psychological and behavioural data. The main focus of this work is to extend G-theory using a mixed-effects location-scale model (MELSM) that allows residual error variance to vary systematically across conditions and persons. By modeling heteroscedasticity, we can extend the computation of condition-specific generalizability ( <math> <semantics> <mrow> <msub><mrow><mi>G</mi></mrow> <mrow><mi>t</mi></mrow> </msub> </mrow> <annotation>$$ {G}_t $$</annotation></semantics> </math> ) and dependability ( <math> <semantics> <mrow> <msub><mrow><mi>D</mi></mrow> <mrow><mi>t</mi></mrow> </msub> </mrow> <annotation>$$ {D}_t $$</annotation></semantics> </math> ) coefficients to reflect local reliability under varying degrees of measurement precision. As an illustration, we apply the model to empirical data from an EEG experiment and show that failing to account for variance heterogeneity can mask meaningful differences in measurement quality. A simulation-based decision study further demonstrates how targeted increases in measurement density can improve reliability for low-precision conditions or participants. The proposed framework retains the interpretative character of classical G-theory while enhancing its flexibility. We argue that it supports finer-grained insights on conditions that influence reliability and better-informed design decisions in psychological measurements. We discuss implications for individualized reliability assessment, adaptive measurement strategies, and future extensions to multi-facet designs.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145999604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fridtjof Petersen, Jonas M B Haslbeck, Jorge N Tendeiro, Anna M Langener, Martien J H Kas, Dimitris Rizopoulos, Laura F Bringmann
The widespread adoption of smartphones creates the possibility to passively monitor everyday behaviour via sensors. Sensor data have been linked to moment-to-moment psychological symptoms and mood of individuals and thus could alleviate the burden associated with repeated measurement of symptoms. Additionally, psychological care could be improved by predicting moments of high psychopathology and providing immediate interventions. Current research assumes that the relationship between sensor data and psychological symptoms is constant over time - or changes with a fixed rate: Models are trained on all past data or on a fixed window, without comparing different window sizes with each other. This is problematic as choosing the wrong training window can negatively impact prediction accuracy, especially if the underlying rate of change is varying. As a potential solution we compare different methodologies for choosing the correct window size ranging from frequent practice based on heuristics to super learning approaches. In a simulation study, we vary the rate of change in the underlying relationship form over time. We show that even computing a simple average across different windows can help reduce the prediction error rather than selecting a single best window for both simulated and real world data.
{"title":"Comparing training window selection methods for prediction in non-stationary time series.","authors":"Fridtjof Petersen, Jonas M B Haslbeck, Jorge N Tendeiro, Anna M Langener, Martien J H Kas, Dimitris Rizopoulos, Laura F Bringmann","doi":"10.1111/bmsp.70018","DOIUrl":"10.1111/bmsp.70018","url":null,"abstract":"<p><p>The widespread adoption of smartphones creates the possibility to passively monitor everyday behaviour via sensors. Sensor data have been linked to moment-to-moment psychological symptoms and mood of individuals and thus could alleviate the burden associated with repeated measurement of symptoms. Additionally, psychological care could be improved by predicting moments of high psychopathology and providing immediate interventions. Current research assumes that the relationship between sensor data and psychological symptoms is constant over time - or changes with a fixed rate: Models are trained on all past data or on a fixed window, without comparing different window sizes with each other. This is problematic as choosing the wrong training window can negatively impact prediction accuracy, especially if the underlying rate of change is varying. As a potential solution we compare different methodologies for choosing the correct window size ranging from frequent practice based on heuristics to super learning approaches. In a simulation study, we vary the rate of change in the underlying relationship form over time. We show that even computing a simple average across different windows can help reduce the prediction error rather than selecting a single best window for both simulated and real world data.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145967898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although individuals may exhibit both gradual and abrupt changes in their dynamic properties as shaped by both slowly accumulating influences and acute events, existing statistical frameworks offer limited capacity for the simultaneous detection and representation of these distinct change patterns. We propose a Bayesian regime-switching (RS) modelling framework and an entropy measure adapted from the frequentist framework to facilitate simultaneous representation and testing of postulates of gradual and abrupt changes. Results from Monte Carlo simulation studies indicated that using a combination of entropy and information criterion measures such as the Bayesian information criterion was consistently most effective at facilitating the selection of the best-fitting model across varying magnitudes of abrupt changes. We found that slight lower entropy thresholds may be helpful in facilitating the selection of longitudinal models with RS properties as this class of models tended to yield lower entropy values than conventional thresholds for reliable classification in cross-sectional mixture models-even under satisfactory parameter recovery and classification results. We fitted the proposed models and other candidate models to the data collected from an intervention study on the psychological well-being (PWB) of college-attending early adults. Results suggested abrupt, regime-related transitions in the intra-individual variability levels of PWB dynamics among some participants following the intervention period. Practical usage of the entropy measure in conjunction with other model selection measures, and guidelines to enhance simultaneous detection of true abrupt and gradual changes are discussed.
{"title":"Simultaneous detection of gradual and abrupt structural changes in Bayesian longitudinal modelling using entropy and model fit measures.","authors":"Yanling Li, Xiaoyue Xiong, Zita Oravecz, Sy-Miin Chow","doi":"10.1111/bmsp.70029","DOIUrl":"10.1111/bmsp.70029","url":null,"abstract":"<p><p>Although individuals may exhibit both gradual and abrupt changes in their dynamic properties as shaped by both slowly accumulating influences and acute events, existing statistical frameworks offer limited capacity for the simultaneous detection and representation of these distinct change patterns. We propose a Bayesian regime-switching (RS) modelling framework and an entropy measure adapted from the frequentist framework to facilitate simultaneous representation and testing of postulates of gradual and abrupt changes. Results from Monte Carlo simulation studies indicated that using a combination of entropy and information criterion measures such as the Bayesian information criterion was consistently most effective at facilitating the selection of the best-fitting model across varying magnitudes of abrupt changes. We found that slight lower entropy thresholds may be helpful in facilitating the selection of longitudinal models with RS properties as this class of models tended to yield lower entropy values than conventional thresholds for reliable classification in cross-sectional mixture models-even under satisfactory parameter recovery and classification results. We fitted the proposed models and other candidate models to the data collected from an intervention study on the psychological well-being (PWB) of college-attending early adults. Results suggested abrupt, regime-related transitions in the intra-individual variability levels of PWB dynamics among some participants following the intervention period. Practical usage of the entropy measure in conjunction with other model selection measures, and guidelines to enhance simultaneous detection of true abrupt and gradual changes are discussed.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12875568/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145919114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The growing popularity of the ecological momentary assessment method in psychological research requires adequate statistical models for intensive longitudinal data (ILD), with multilevel latent state-trait (ML-LST) models based on the latent state-trait theory revised (LST-R theory) as one possible alternative. Besides the traditional LST-R coefficients reliability, consistency and occasion-specificity, ML-LST models are also suitable for estimating reliability at Level 1 ("within-subject reliability") and Level 2 ("between-subject reliability"). However, these level-specific coefficients have not yet been defined in LST-R theory and, therefore, their interpretation has been unclear from the perspective of LST-R theory. In the current study, we discuss the interpretation and identification of these coefficients based on the (multilevel) versions of the Multistate-Singletrait (MSST), the Multistate-Indicator-specific trait (MSIT) and the Multistate-Singletrait model with M-1 correlated method factors (MSST-M-1). We show that, in the MSST-M-1 model, the between-subject coefficient is a measure of the indicator-unspecificity of an item (i.e. the portion of between-level variance that a specific item shares with a common trait) or the unidimensionality of a scale. Moreover, we highlight differences between occasion-specificity and within-subject reliability. The performance of the ML-MSST-M-1 model and the corresponding theoretical findings are illustrated using data from an experience sampling study on the within-person fluctuations of narcissistic admiration (Heyde et al., 2023).
随着生态瞬时评价方法在心理学研究领域的日益普及,对密集纵向数据(ILD)的统计模型提出了要求,基于潜态-特质修正理论(LST-R理论)的多层次潜态-特质(ML-LST)模型是一种可能的选择。除了传统的LST-R系数信度、一致性和场合特异性外,ML-LST模型也适用于估计一级(“主体内信度”)和二级(“主体间信度”)的信度。然而,这些水平特异性系数在LST-R理论中尚未定义,因此从LST-R理论的角度对其解释尚不明确。在本研究中,我们基于多状态-单性状(MSST)、多状态-指标特异性性状(MSIT)和具有M-1相关方法因子的多状态-单性状模型(MSST-M-1)的(多水平)版本讨论了这些系数的解释和识别。我们表明,在mst - m -1模型中,被试间系数是衡量一个项目的指标非特异性(即一个特定项目与一个共同特征共享的水平间方差的部分)或量表的单维性。此外,我们强调了场合特异性和主体内信度之间的差异。ml - mst - m -1模型的性能和相应的理论发现使用了一项关于自恋崇拜的个人波动的经验抽样研究的数据(Heyde et al., 2023)。
{"title":"Level-specific reliability coefficients from the perspective of latent state-trait theory.","authors":"Lennart Nacke, Axel Mayer","doi":"10.1111/bmsp.70027","DOIUrl":"https://doi.org/10.1111/bmsp.70027","url":null,"abstract":"<p><p>The growing popularity of the ecological momentary assessment method in psychological research requires adequate statistical models for intensive longitudinal data (ILD), with multilevel latent state-trait (ML-LST) models based on the latent state-trait theory revised (LST-R theory) as one possible alternative. Besides the traditional LST-R coefficients reliability, consistency and occasion-specificity, ML-LST models are also suitable for estimating reliability at Level 1 (\"within-subject reliability\") and Level 2 (\"between-subject reliability\"). However, these level-specific coefficients have not yet been defined in LST-R theory and, therefore, their interpretation has been unclear from the perspective of LST-R theory. In the current study, we discuss the interpretation and identification of these coefficients based on the (multilevel) versions of the Multistate-Singletrait (MSST), the Multistate-Indicator-specific trait (MSIT) and the Multistate-Singletrait model with M-1 correlated method factors (MSST-M-1). We show that, in the MSST-M-1 model, the between-subject coefficient is a measure of the indicator-unspecificity of an item (i.e. the portion of between-level variance that a specific item shares with a common trait) or the unidimensionality of a scale. Moreover, we highlight differences between occasion-specificity and within-subject reliability. The performance of the ML-MSST-M-1 model and the corresponding theoretical findings are illustrated using data from an experience sampling study on the within-person fluctuations of narcissistic admiration (Heyde et al., 2023).</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145844542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Latent variable models typically require large sample sizes for acceptable efficiency and reliable convergence. Appropriate informative priors are often required for gainfully employing Bayesian analysis with small samples. Power priors are informative priors built on historical data, weighted to account for non-exchangeability with the current sample. Many extant power prior approaches are designed for manifest variable models, and are not easily adapted for latent variable models, for example, they may require integration over all model parameters. We examined two recent power prior approaches straightforward to adapt to these models, Mahalanobis weight (MW) priors based on Golchi (Use of historical individual patient data in analysis of clinical trials, 2020), and univariate priors, based on Finch (The Psychiatrist, 6, 2024, 45)'s application of Haddad et al. (Journal of Biopharmaceutical Statistics, 27, 2017, 1089) and Balcome et al. (bayesdp: Implementation of the Bayesian discount prior approach for clinical trials, 2022). We applied these approaches along with diffuse and weakly informative priors to a latent variable mediation model, under various sample sizes and non-exchangeability conditions. We compared their performances in terms of convergence, bias, efficiency, and credible interval coverage when estimating an indirect effect. Diffuse priors and the univariate approach lead to poor convergence. The weakly informative and MW approach both improved convergence and yielded reasonable estimates, but MW performed poorly under some non-exchangeable conditions. We discussed the issues with these approaches and future research directions.
潜在变量模型通常需要较大的样本量才能获得可接受的效率和可靠的收敛性。适当的信息先验通常需要在小样本中有效地使用贝叶斯分析。功率先验是建立在历史数据基础上的信息先验,加权以考虑与当前样本的不可交换性。许多现有的幂先验方法是为明显变量模型设计的,并且不容易适用于潜在变量模型,例如,它们可能需要对所有模型参数进行集成。我们研究了最近两种直接适应这些模型的功率先验方法,基于Golchi的Mahalanobis权重(MW)先验(在临床试验分析中使用个体患者的历史数据,2020),以及基于Finch (The psychiatry, 6,2024, 45)应用Haddad等人(Journal of biopharmacicalstatistics, 27,2017, 1089)和Balcome等人(bayesdp:临床试验贝叶斯折扣先验方法的实现,2022)的单变量先验。在不同样本量和不可交换性条件下,我们将这些方法与弥漫性和弱信息先验一起应用于潜在变量中介模型。在估计间接影响时,我们比较了它们在收敛性、偏差、效率和可信区间覆盖方面的表现。扩散先验和单变量方法导致收敛性差。弱信息方法和最小估计方法都提高了收敛性并产生了合理的估计,但最小估计方法在一些非交换条件下表现不佳。讨论了这些方法存在的问题和未来的研究方向。
{"title":"Power priors for latent variable mediation models under small sample sizes.","authors":"Lihan Chen, Milica Miočević, Carl F Falk","doi":"10.1111/bmsp.70025","DOIUrl":"https://doi.org/10.1111/bmsp.70025","url":null,"abstract":"<p><p>Latent variable models typically require large sample sizes for acceptable efficiency and reliable convergence. Appropriate informative priors are often required for gainfully employing Bayesian analysis with small samples. Power priors are informative priors built on historical data, weighted to account for non-exchangeability with the current sample. Many extant power prior approaches are designed for manifest variable models, and are not easily adapted for latent variable models, for example, they may require integration over all model parameters. We examined two recent power prior approaches straightforward to adapt to these models, Mahalanobis weight (MW) priors based on Golchi (Use of historical individual patient data in analysis of clinical trials, 2020), and univariate priors, based on Finch (The Psychiatrist, 6, 2024, 45)'s application of Haddad et al. (Journal of Biopharmaceutical Statistics, 27, 2017, 1089) and Balcome et al. (bayesdp: Implementation of the Bayesian discount prior approach for clinical trials, 2022). We applied these approaches along with diffuse and weakly informative priors to a latent variable mediation model, under various sample sizes and non-exchangeability conditions. We compared their performances in terms of convergence, bias, efficiency, and credible interval coverage when estimating an indirect effect. Diffuse priors and the univariate approach lead to poor convergence. The weakly informative and MW approach both improved convergence and yielded reasonable estimates, but MW performed poorly under some non-exchangeable conditions. We discussed the issues with these approaches and future research directions.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145822099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann
Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.
{"title":"Idiographic interrater reliability measures for intensive longitudinal multirater data.","authors":"Tobias Koch, Miriam F Jaehne, Michaela Riediger, Antje Rauers, Jana Holtmann","doi":"10.1111/bmsp.70022","DOIUrl":"https://doi.org/10.1111/bmsp.70022","url":null,"abstract":"<p><p>Interrater reliability plays a crucial role in various areas of psychology. In this article, we propose a multilevel latent time series model for intensive longitudinal data with structurally different raters (e.g., self-reports and partner reports). The new MR-MLTS model enables researchers to estimate idiographic (person-specific) rater consistency coefficients for contemporaneous or dynamic rater agreement. Additionally, the model allows rater consistency coefficients to be linked to external explanatory or outcome variables. It can be implemented in Mplus as well as in the newly developed R package mlts. We illustrate the model using data from an intensive longitudinal multirater study involving 100 heterosexual couples (200 individuals) assessed across 86 time points. Our findings show that relationship duration and partner cognitive resources positively predict rater consistency for the innovations. Results from a simulation study indicate that the number of time points is critical for accurately estimating idiographic rater consistency coefficients, whereas the number of participants is important for accurately recovering the random effect variances. We discuss advantages, limitations, and future extensions of the MR-MLTS model.</p>","PeriodicalId":55322,"journal":{"name":"British Journal of Mathematical & Statistical Psychology","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145795447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}