Bayesian estimation of the number of significant principal components for cultural data

arXiv - STAT - Applications Pub Date : 2024-09-18 DOI:arxiv-2409.12129

Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram

{"title":"Bayesian estimation of the number of significant principal components for cultural data","authors":"Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram","doi":"arxiv-2409.12129","DOIUrl":null,"url":null,"abstract":"Principal component analysis (PCA) is often used to analyze multivariate data\ntogether with cluster analysis, which depends on the number of principal\ncomponents used. It is therefore important to determine the number of\nsignificant principal components (PCs) extracted from a data set. Here we use a\nvariational Bayesian version of classical PCA, to develop a new method for\nestimating the number of significant PCs in contexts where the number of\nsamples is of a similar to or greater than the number of features. This\neliminates guesswork and potential bias in manually determining the number of\nprincipal components and avoids overestimation of variance by filtering noise.\nThis framework can be applied to datasets of different shapes (number of rows\nand columns), different data types (binary, ordinal, categorical, continuous),\nand with noisy and missing data. Therefore, it is especially useful for data\nwith arbitrary encodings and similar numbers of rows and columns, such as\ncultural, ecological, morphological, and behavioral datasets. We tested our\nmethod on both synthetic data and empirical datasets and found that it may\nunderestimate but not overestimate the number of principal components for the\nsynthetic data. A small number of components was found for each empirical\ndataset. These results suggest that it is broadly applicable across the life\nsciences.","PeriodicalId":501172,"journal":{"name":"arXiv - STAT - Applications","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Principal component analysis (PCA) is often used to analyze multivariate data together with cluster analysis, which depends on the number of principal components used. It is therefore important to determine the number of significant principal components (PCs) extracted from a data set. Here we use a variational Bayesian version of classical PCA, to develop a new method for estimating the number of significant PCs in contexts where the number of samples is of a similar to or greater than the number of features. This eliminates guesswork and potential bias in manually determining the number of principal components and avoids overestimation of variance by filtering noise. This framework can be applied to datasets of different shapes (number of rows and columns), different data types (binary, ordinal, categorical, continuous), and with noisy and missing data. Therefore, it is especially useful for data with arbitrary encodings and similar numbers of rows and columns, such as cultural, ecological, morphological, and behavioral datasets. We tested our method on both synthetic data and empirical datasets and found that it may underestimate but not overestimate the number of principal components for the synthetic data. A small number of components was found for each empirical dataset. These results suggest that it is broadly applicable across the life sciences.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用贝叶斯方法估算文化数据的重要主成分数量

主成分分析（PCA）通常与聚类分析一起用于分析多变量数据，而聚类分析则取决于所使用的主成分数量。因此，确定从数据集中提取的重要主成分（PC）的数量非常重要。在此，我们使用经典 PCA 的变异贝叶斯版本，开发出一种新方法，可以在样本数量与特征数量相近或大于特征数量的情况下，估算出重要 PC 的数量。该框架可应用于不同形状（行列数）、不同数据类型（二元、序数、分类、连续）以及存在噪声和缺失数据的数据集。因此，该方法尤其适用于具有任意编码和类似行列数的数据，如文化、生态、形态和行为数据集。我们在合成数据和经验数据集上测试了我们的方法，发现它可能会低估但不会高估合成数据的主成分数。每个经验数据集的主成分数量都很少。这些结果表明，该方法广泛适用于生命科学领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Applications

自引率

0.00%

发文量