Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer

IF 2.1 4区数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Statistical Analysis and Data Mining Pub Date : 2024-01-25 DOI:10.1002/sam.11657

Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah

{"title":"Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer","authors":"Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah","doi":"10.1002/sam.11657","DOIUrl":null,"url":null,"abstract":"As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"85 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/sam.11657","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

非参数贝叶斯功能聚类在乳腺癌种族差异中的应用

随着我们更容易获取海量数据集，功能分析越来越受到关注。然而，这类数据集通常包含大量异质性、噪声和维度。当把分析从向量推广到函数时，经典方法可能无法直接发挥作用。本文从两个方面考虑在函数分析中减少噪声信息：通过函数聚类将相似的观测数据归类，从而减少样本量；通过函数变量选择降低维度。由于贝叶斯层次模型的灵活性，复杂的数据结构和关系很容易用贝叶斯层次模型来建模。因此，本文提出了一种非参数贝叶斯函数聚类和峰值点选择方法，通过加权狄利克特过程混合物（WDPM）建模，结合条件拉普拉斯先验（一种共轭变量选择先验），自动聚类并提供精确估计。所提出的方法简称为 WDPM-VS，能同时完成以下任务：（1）自动聚类，无需事先指定聚类数目或聚类中心；（2）对异质函数进行聚类；（3）选择振动峰点；以及（4）从样本量和维度两个角度减少噪声信息。在均方根误差方面，该方法将大大优于同类方法。基于该方法，我们能够找出解释乳腺癌种族差异的生物学因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Statistical Analysis and Data Mining COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCEC-COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

CiteScore

3.20

自引率

7.70%

发文量

期刊介绍： Statistical Analysis and Data Mining addresses the broad area of data analysis, including statistical approaches, machine learning, data mining, and applications. Topics include statistical and computational approaches for analyzing massive and complex datasets, novel statistical and/or machine learning methods and theory, and state-of-the-art applications with high impact. Of special interest are articles that describe innovative analytical techniques, and discuss their application to real problems, in such a way that they are accessible and beneficial to domain experts across science, engineering, and commerce. The focus of the journal is on papers which satisfy one or more of the following criteria: Solve data analysis problems associated with massive, complex datasets Develop innovative statistical approaches, machine learning algorithms, or methods integrating ideas across disciplines, e.g., statistics, computer science, electrical engineering, operation research. Formulate and solve high-impact real-world problems which challenge existing paradigms via new statistical and/or computational models Provide survey to prominent research topics.