Bayesian estimation of the number of significant principal components for cultural data

Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram
{"title":"Bayesian estimation of the number of significant principal components for cultural data","authors":"Joshua C. Macdonald, Javier Blanco-Portillo, Marcus W. Feldman, Yoav Ram","doi":"arxiv-2409.12129","DOIUrl":null,"url":null,"abstract":"Principal component analysis (PCA) is often used to analyze multivariate data\ntogether with cluster analysis, which depends on the number of principal\ncomponents used. It is therefore important to determine the number of\nsignificant principal components (PCs) extracted from a data set. Here we use a\nvariational Bayesian version of classical PCA, to develop a new method for\nestimating the number of significant PCs in contexts where the number of\nsamples is of a similar to or greater than the number of features. This\neliminates guesswork and potential bias in manually determining the number of\nprincipal components and avoids overestimation of variance by filtering noise.\nThis framework can be applied to datasets of different shapes (number of rows\nand columns), different data types (binary, ordinal, categorical, continuous),\nand with noisy and missing data. Therefore, it is especially useful for data\nwith arbitrary encodings and similar numbers of rows and columns, such as\ncultural, ecological, morphological, and behavioral datasets. We tested our\nmethod on both synthetic data and empirical datasets and found that it may\nunderestimate but not overestimate the number of principal components for the\nsynthetic data. A small number of components was found for each empirical\ndataset. These results suggest that it is broadly applicable across the life\nsciences.","PeriodicalId":501172,"journal":{"name":"arXiv - STAT - Applications","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12129","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Principal component analysis (PCA) is often used to analyze multivariate data together with cluster analysis, which depends on the number of principal components used. It is therefore important to determine the number of significant principal components (PCs) extracted from a data set. Here we use a variational Bayesian version of classical PCA, to develop a new method for estimating the number of significant PCs in contexts where the number of samples is of a similar to or greater than the number of features. This eliminates guesswork and potential bias in manually determining the number of principal components and avoids overestimation of variance by filtering noise. This framework can be applied to datasets of different shapes (number of rows and columns), different data types (binary, ordinal, categorical, continuous), and with noisy and missing data. Therefore, it is especially useful for data with arbitrary encodings and similar numbers of rows and columns, such as cultural, ecological, morphological, and behavioral datasets. We tested our method on both synthetic data and empirical datasets and found that it may underestimate but not overestimate the number of principal components for the synthetic data. A small number of components was found for each empirical dataset. These results suggest that it is broadly applicable across the life sciences.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用贝叶斯方法估算文化数据的重要主成分数量
主成分分析(PCA)通常与聚类分析一起用于分析多变量数据,而聚类分析则取决于所使用的主成分数量。因此,确定从数据集中提取的重要主成分(PC)的数量非常重要。在此,我们使用经典 PCA 的变异贝叶斯版本,开发出一种新方法,可以在样本数量与特征数量相近或大于特征数量的情况下,估算出重要 PC 的数量。该框架可应用于不同形状(行列数)、不同数据类型(二元、序数、分类、连续)以及存在噪声和缺失数据的数据集。因此,该方法尤其适用于具有任意编码和类似行列数的数据,如文化、生态、形态和行为数据集。我们在合成数据和经验数据集上测试了我们的方法,发现它可能会低估但不会高估合成数据的主成分数。每个经验数据集的主成分数量都很少。这些结果表明,该方法广泛适用于生命科学领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Bayesian estimation of the number of significant principal components for cultural data Optimal Visual Search with Highly Heuristic Decision Rules Who's the GOAT? Sports Rankings and Data-Driven Random Walks on the Symmetric Group Conformity assessment of processes and lots in the framework of JCGM 106:2012 Equity considerations in COVID-19 vaccine allocation modelling: a literature review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1