Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion

IF 1.8 4区 计算机科学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS Journal of Classification Pub Date : 2024-06-19 DOI:10.1007/s00357-024-09477-z
Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian
{"title":"Clustering Longitudinal Data for Growth Curve Modelling by Gibbs Sampler and Information Criterion","authors":"Yu Fei, Rongli Li, Zhouhong Li, Guoqi Qian","doi":"10.1007/s00357-024-09477-z","DOIUrl":null,"url":null,"abstract":"<p>Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.</p>","PeriodicalId":50241,"journal":{"name":"Journal of Classification","volume":"199 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Classification","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00357-024-09477-z","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Clustering longitudinal data for growth curve modelling is considered in this paper, where we aim to optimally estimate the underpinning unknown group partition matrix. Instead of following the conventional soft clustering approach, which assumes the columns of the partition matrix to have i.i.d. multinomial or categorical prior distributions and uses a regression model with the response following a finite mixture distribution to estimate the posterior distribution of the partition matrix, we propose an iterative partition and regression procedure to find the best partition matrix and the associated best growth curve regression model for each identified cluster. We show that the best partition matrix is the one minimizing a recently developed empirical Bayes information criterion (eBIC), which, due to the involved combinatorial explosion, is difficult to compute via enumerating all candidate partition matrices. Thus, we develop a Gibbs sampling method to generate a Markov chain of candidate partition matrices that has its equilibrium probability distribution equal the one induced from eBIC. We further show that the best partition matrix, given a priori the number of latent clusters, can be consistently estimated and is computationally scalable based on this Markov chain. The number of latent clusters is also best estimated by minimizing eBIC. The proposed iterative clustering and regression method is assessed by a comprehensive simulation study before being applied to two real-world growth curve modelling examples involving longitudinal data clustering.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用吉布斯采样器和信息标准对纵向数据进行聚类以建立生长曲线模型
本文考虑了为生长曲线建模而对纵向数据进行聚类的问题,我们的目标是对未知群体分区矩阵的基础数据进行最佳估算。传统的软聚类方法假定分区矩阵的列具有 i.i.d. 多叉或分类先验分布,并使用响应遵循有限混合分布的回归模型来估计分区矩阵的后验分布,而我们则不采用这种方法,而是提出了一种迭代分区和回归程序,以找到最佳分区矩阵和每个已识别群组的相关最佳生长曲线回归模型。我们证明,最佳分区矩阵是最小化最近开发的经验贝叶斯信息准则(eBIC)的矩阵,由于涉及组合爆炸,很难通过枚举所有候选分区矩阵来计算。因此,我们开发了一种吉布斯抽样方法,生成候选分区矩阵的马尔可夫链,其均衡概率分布等于由 eBIC 诱导的概率分布。我们进一步证明,在给定潜在集群数的先验条件下,可以根据该马尔可夫链持续估计出最佳的分区矩阵,并且在计算上是可扩展的。潜在聚类的数量也可以通过最小化 eBIC 得到最佳估计。在将所提出的迭代聚类和回归方法应用于两个涉及纵向数据聚类的真实世界生长曲线建模实例之前,先通过全面的模拟研究对其进行了评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Classification
Journal of Classification 数学-数学跨学科应用
CiteScore
3.60
自引率
5.00%
发文量
16
审稿时长
>12 weeks
期刊介绍: To publish original and valuable papers in the field of classification, numerical taxonomy, multidimensional scaling and other ordination techniques, clustering, tree structures and other network models (with somewhat less emphasis on principal components analysis, factor analysis, and discriminant analysis), as well as associated models and algorithms for fitting them. Articles will support advances in methodology while demonstrating compelling substantive applications. Comprehensive review articles are also acceptable. Contributions will represent disciplines such as statistics, psychology, biology, information retrieval, anthropology, archeology, astronomy, business, chemistry, computer science, economics, engineering, geography, geology, linguistics, marketing, mathematics, medicine, political science, psychiatry, sociology, and soil science.
期刊最新文献
How to Measure the Researcher Impact with the Aid of its Impactable Area: A Concrete Approach Using Distance Geometry Multi-task Support Vector Machine Classifier with Generalized Huber Loss Clustering-Based Oversampling Algorithm for Multi-class Imbalance Learning Combining Semi-supervised Clustering and Classification Under a Generalized Framework Slope Stability Classification Model Based on Single-Valued Neutrosophic Matrix Energy and Its Application Under a Single-Valued Neutrosophic Matrix Scenario
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1