Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion

IF 1.9 4区计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Journal of Classification Pub Date : 2024-06-19 DOI:10.1007/s00357-024-09477-z

Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian

{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":null,"url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"199 1","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-024-09477-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用吉布斯采样器和信息标准对纵向数据进行聚类以建立生长曲线模型

本文考虑了为生长曲线建模而对纵向数据进行聚类的问题，我们的目标是对未知群体分区矩阵的基础数据进行最佳估算。传统的软聚类方法假定分区矩阵的列具有 i.i.d. 多叉或分类先验分布，并使用响应遵循有限混合分布的回归模型来估计分区矩阵的后验分布，而我们则不采用这种方法，而是提出了一种迭代分区和回归程序，以找到最佳分区矩阵和每个已识别群组的相关最佳生长曲线回归模型。我们证明，最佳分区矩阵是最小化最近开发的经验贝叶斯信息准则（eBIC）的矩阵，由于涉及组合爆炸，很难通过枚举所有候选分区矩阵来计算。因此，我们开发了一种吉布斯抽样方法，生成候选分区矩阵的马尔可夫链，其均衡概率分布等于由 eBIC 诱导的概率分布。我们进一步证明，在给定潜在集群数的先验条件下，可以根据该马尔可夫链持续估计出最佳的分区矩阵，并且在计算上是可扩展的。潜在聚类的数量也可以通过最小化 eBIC 得到最佳估计。在将所提出的迭代聚类和回归方法应用于两个涉及纵向数据聚类的真实世界生长曲线建模实例之前，先通过全面的模拟研究对其进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Classification 数学-数学跨学科应用

CiteScore

3.60

自引率

5.00%

发文量

审稿时长

>12 weeks

期刊介绍： To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.