Optimal variable clustering for high-dimensional matrix valued data.

IF 1.6 4区数学 Q2 MATHEMATICS, APPLIED Information and Inference-A Journal of the Ima Pub Date : 2025-03-12 eCollection Date: 2025-03-01 DOI:10.1093/imaiai/iaaf001

Inbeom Lee, Siyi Deng, Yang Ning

{"title":"Optimal variable clustering for high-dimensional matrix valued data.","authors":"Inbeom Lee, Siyi Deng, Yang Ning","doi":"10.1093/imaiai/iaaf001","DOIUrl":null,"url":null,"abstract":"<p><p>Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.</p>","PeriodicalId":45437,"journal":{"name":"Information and Inference-A Journal of the Ima","volume":"14 1","pages":"iaaf001"},"PeriodicalIF":1.6000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899537/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Inference-A Journal of the Ima","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1093/imaiai/iaaf001","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICS, APPLIED","Score":null,"Total":0}

引用次数: 0

Abstract

Matrix valued data has become increasingly prevalent in many applications. Most of the existing clustering methods for this type of data are tailored to the mean model and do not account for the dependence structure of the features, which can be very informative, especially in high-dimensional settings or when mean information is not available. To extract the information from the dependence structure for clustering, we propose a new latent variable model for the features arranged in matrix form, with some unknown membership matrices representing the clusters for the rows and columns. Under this model, we further propose a class of hierarchical clustering algorithms using the difference of a weighted covariance matrix as the dissimilarity measure. Theoretically, we show that under mild conditions, our algorithm attains clustering consistency in the high-dimensional setting. While this consistency result holds for our algorithm with a broad class of weighted covariance matrices, the conditions for this result depend on the choice of the weight. To investigate how the weight affects the theoretical performance of our algorithm, we establish the minimax lower bound for clustering under our latent variable model in terms of some cluster separation metric. Given these results, we identify the optimal weight in the sense that using this weight guarantees our algorithm to be minimax rate-optimal. The practical implementation of our algorithm with the optimal weight is also discussed. Simulation studies show that our algorithm performs better than existing methods in terms of the adjusted Rand index (ARI). The method is applied to a genomic dataset and yields meaningful interpretations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高维矩阵值数据的最优变量聚类。

矩阵值数据在许多应用中变得越来越普遍。大多数针对这类数据的现有聚类方法都是针对均值模型定制的，并且没有考虑特征的依赖结构，这可能是非常有用的，特别是在高维设置或平均值信息不可用的情况下。为了从依赖结构中提取信息用于聚类，我们提出了一种新的潜在变量模型，用于以矩阵形式排列的特征，用一些未知隶属矩阵表示行和列的聚类。在此模型下，我们进一步提出了一类以加权协方差矩阵的差值作为不相似度度量的分层聚类算法。理论上，我们证明了在温和的条件下，我们的算法在高维设置下达到了聚类一致性。虽然这种一致性结果适用于我们的算法，并具有广泛的加权协方差矩阵，但该结果的条件取决于权重的选择。为了研究权重如何影响我们算法的理论性能，我们根据一些聚类分离度量，在我们的潜变量模型下建立了聚类的最小最大下界。给定这些结果，我们在某种意义上确定最优权重，使用该权重保证我们的算法是最小最大速率最优的。最后讨论了该算法在最优权值下的实际实现。仿真研究表明，该算法在调整后的Rand指数（ARI）方面优于现有方法。该方法应用于基因组数据集并产生有意义的解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Information and Inference-A Journal of the Ima Multiple-

CiteScore

3.90

自引率

0.00%

发文量