Computational Statistics & Data Analysis最新文献

英文中文

A goodness-of-fit test for geometric Brownian motion

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-04-23 DOI: 10.1016/j.csda.2025.108196

Daniel Gaigall , Philipp Wübbolding

A new goodness-of-fit test for the composite null hypothesis that data originate from a geometric Brownian motion is studied in the functional data setting. This is equivalent to testing if the data are from a scaled Brownian motion with linear drift. Critical values for the test are obtained, ensuring that the specified significance level is achieved in finite samples. The asymptotic behavior of the test statistic under the null distribution and alternatives is studied, and it is also demonstrated that the test is consistent. Furthermore, the proposed approach offers advantages in terms of fast and simple implementation. A comprehensive simulation study shows that the power of the new test compares favorably to that of existing methods. A key application is the assessment of financial time series for the suitability of the Black-Scholes model. Examples relating to various stock and interest rate time series are presented in order to illustrate the proposed test.

引用次数: 0

Co-clustering multi-view data using the Latent Block Model

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-04-10 DOI: 10.1016/j.csda.2025.108188

Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang

The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.

{"title":"Co-clustering multi-view data using the Latent Block Model","authors":"Joshua Tobin , Michaela Black , James Ng , Debbie Rankin , Jonathan Wallace , Catherine Hughes , Leane Hoey , Adrian Moore , Jinling Wang , Geraldine Horigan , Paul Carlin , Helene McNulty , Anne M. Molloy , Mimi Zhang","doi":"10.1016/j.csda.2025.108188","DOIUrl":"10.1016/j.csda.2025.108188","url":null,"abstract":"<div><div>The Latent Block Model (LBM) is a prominent model-based co-clustering method, returning parametric representations of each block-cluster and allowing the use of well-grounded model selection methods. Although the LBM has been adapted to accommodate various feature types, it cannot be applied to datasets consisting of multiple distinct sets of features, termed views, for a common set of observations. The multi-view LBM is introduced herein, extending the LBM method to multi-view data, where each view marginally follows an LBM. For any pair of two views, the dependence between them is captured by a row-cluster membership matrix. A likelihood-based approach is formulated for parameter estimation, harnessing a stochastic EM algorithm merged with a Gibbs sampler, while an ICL criterion is formulated to determine the number of row- and column-clusters in each view. To justify the application of the multi-view approach, hypothesis tests are formulated to evaluate the independence of row-clusters across views, with the testing procedure seamlessly integrated into the estimation framework. A penalty scheme is also introduced to induce sparsity in row-clusterings. The algorithm's performance is validated using synthetic and real-world datasets, accompanied by recommendations for optimal parameter selection. Finally, the multi-view co-clustering method is applied to a complex genomics dataset, and is shown to provide new insights for high-dimension multi-view problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108188"},"PeriodicalIF":1.5,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143868688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Non-parametric tests for cross-dependence based on multivariate extensions of ordinal patterns

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-04-10 DOI: 10.1016/j.csda.2025.108189

Angelika Silbernagel , Christian H. Weiß , Alexander Schnurr

Analyzing the cross-dependence within sequentially observed pairs of random variables is an interesting mathematical problem that also has several practical applications. Most of the time, classical dependence measures like Pearson's correlation are used to this end. This quantity, however, only measures linear dependence and has other drawbacks as well. Different concepts for measuring cross-dependence in sequentially observed random vectors, which are based on so-called ordinal patterns or multivariate generalizations of them, are described. In all cases, limiting distributions of the corresponding test statistics are derived. In a simulation study, the performance of these statistics is compared with three competitors, namely, classical Pearson's and Spearman's correlation as well as the rank-based Chatterjee's correlation coefficient. The applicability of the test statistics is illustrated by using them on two real-world data examples.

引用次数: 0

A flexible mixed-membership model for community and enterotype detection for microbiome data

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-04-04 DOI: 10.1016/j.csda.2025.108181

Alice Giampino, Roberto Ascari, Sonia Migliorati

Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.

{"title":"A flexible mixed-membership model for community and enterotype detection for microbiome data","authors":"Alice Giampino, Roberto Ascari, Sonia Migliorati","doi":"10.1016/j.csda.2025.108181","DOIUrl":"10.1016/j.csda.2025.108181","url":null,"abstract":"<div><div>Understanding how the human gut microbiome affects host health is challenging due to the wide interindividual variability, sparsity, and high dimensionality of microbiome data. Mixed-membership models have been previously applied to these data to detect latent communities of bacterial taxa that are expected to co-occur. The most widely used mixed-membership model is latent Dirichlet allocation (LDA). However, LDA is limited by the rigidity of the Dirichlet distribution imposed on the community proportions, which hinders its ability to model dependencies and account for overdispersion. To address this limitation, a generalization of LDA is proposed that introduces greater flexibility into the covariance matrix by incorporating the flexible Dirichlet (FD), a specific identifiable mixture with Dirichlet components. In addition to identifying communities, the new model enables the detection of enterotypes, i.e., clusters of samples with similar microbe composition. For inferential purposes, a computationally efficient collapsed Gibbs sampler that exploits the conjugacy of the FD distribution with respect to the multinomial model is proposed. A simulation study demonstrates the model's ability to accurately recover true parameter values by minimizing appropriate compositional discrepancy measures between the true and estimated values. Additionally, the model correctly identifies the number of communities, as evidenced by perplexity scores. Moreover, an application to the COMBO dataset highlights its effectiveness in detecting biologically significant and coherent communities and enterotypes, revealing a broader range of correlations between community abundances. These results underscore the new model as a definite improvement over LDA.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"210 ","pages":"Article 108181"},"PeriodicalIF":1.5,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiply robust estimation of causal effects using linked data

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-04-02 DOI: 10.1016/j.csda.2025.108175

Shanshan Luo , Yechi Zhang , Wei Li , Zhi Geng

Unmeasured confounding presents a common challenge in observational studies, potentially making standard causal parameters unidentifiable without additional assumptions. Given the increasing availability of diverse data sources, exploiting data linkage offers a potential solution to mitigate unmeasured confounding within a primary study of interest. However, this approach often introduces selection bias, as data linkage is feasible only for a subset of the study population. To address such a concern, this paper explores three nonparametric identification strategies assuming that a unit's inclusion in the linked cohort is determined solely by the observed confounders, while acknowledging that the ignorability assumption may depend on some partially unobserved covariates. The existence of multiple identification strategies motivates the development of estimators that effectively capture distinct components of the observed data distribution. Appropriately combining these estimators yields triply robust estimators for the average treatment effect. These estimators remain consistent if at least one of the three distinct parts of the observed data law is correct. Moreover, they are locally efficient if all the models are correctly specified. The proposed estimators are evaluated using simulation studies and real data analysis.

{"title":"Multiply robust estimation of causal effects using linked data","authors":"Shanshan Luo , Yechi Zhang , Wei Li , Zhi Geng","doi":"10.1016/j.csda.2025.108175","DOIUrl":"10.1016/j.csda.2025.108175","url":null,"abstract":"<div><div>Unmeasured confounding presents a common challenge in observational studies, potentially making standard causal parameters unidentifiable without additional assumptions. Given the increasing availability of diverse data sources, exploiting data linkage offers a potential solution to mitigate unmeasured confounding within a primary study of interest. However, this approach often introduces selection bias, as data linkage is feasible only for a subset of the study population. To address such a concern, this paper explores three nonparametric identification strategies assuming that a unit's inclusion in the linked cohort is determined solely by the observed confounders, while acknowledging that the ignorability assumption may depend on some partially unobserved covariates. The existence of multiple identification strategies motivates the development of estimators that effectively capture distinct components of the observed data distribution. Appropriately combining these estimators yields triply robust estimators for the average treatment effect. These estimators remain consistent if at least one of the three distinct parts of the observed data law is correct. Moreover, they are locally efficient if all the models are correctly specified. The proposed estimators are evaluated using simulation studies and real data analysis.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108175"},"PeriodicalIF":1.5,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143769157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Eliciting prior information from clinical trials via calibrated Bayes factor

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-03-31 DOI: 10.1016/j.csda.2025.108180

Roberto Macrì Demartino , Leonardo Egidi , Nicola Torelli , Ioannis Ntzoufras

In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated with a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. Although this parameter can be modeled as either fixed or random, eliciting its prior distribution via a full Bayesian approach remains challenging. In general, this parameter should be carefully selected to accurately reflect the available historical information without dominating the posterior inferential conclusions. A novel simulation-based calibrated Bayes factor procedure is proposed to elicit the prior distribution of the weight parameter, allowing it to be updated according to the strength of the evidence in the data. The goal is to facilitate the integration of historical data when there is agreement with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials.

{"title":"Eliciting prior information from clinical trials via calibrated Bayes factor","authors":"Roberto Macrì Demartino , Leonardo Egidi , Nicola Torelli , Ioannis Ntzoufras","doi":"10.1016/j.csda.2025.108180","DOIUrl":"10.1016/j.csda.2025.108180","url":null,"abstract":"<div><div>In the Bayesian framework power prior distributions are increasingly adopted in clinical trials and similar studies to incorporate external and past information, typically to inform the parameter associated with a treatment effect. Their use is particularly effective in scenarios with small sample sizes and where robust prior information is available. A crucial component of this methodology is represented by its weight parameter, which controls the volume of historical information incorporated into the current analysis. Although this parameter can be modeled as either fixed or random, eliciting its prior distribution via a full Bayesian approach remains challenging. In general, this parameter should be carefully selected to accurately reflect the available historical information without dominating the posterior inferential conclusions. A novel simulation-based calibrated Bayes factor procedure is proposed to elicit the prior distribution of the weight parameter, allowing it to be updated according to the strength of the evidence in the data. The goal is to facilitate the integration of historical data when there is agreement with current information and to limit it when discrepancies arise in terms, for instance, of prior-data conflicts. The performance of the proposed method is tested through simulation studies and applied to real data from clinical trials.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108180"},"PeriodicalIF":1.5,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143746613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discretization: Privacy-preserving data publishing for causal discovery

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108174

Youngmin Ahn , Woongjoon Park , Gunwoong Park

As the importance of data privacy continues to grow, data masking has emerged as a crucial method. Notably, data masking techniques aim to protect individual privacy, while enabling data analysts to derive meaningful statistical results, such as the identification of directional or causal relationships between variables. Hence, this study demonstrates the advantages of a quantile-based discretization for protecting privacy and uncovering the relationships between variables in Gaussian directed acyclic graphical (DAG) models. Specifically, it introduces quantile-discretized Gaussian DAG models where each node variable is discretized based on the quantiles. Additionally, it proposes the bi-partition process, which aids in recovering the covariance matrix; hence, the models can be identifiable. Furthermore, a consistent algorithm is developed for learning the underlying structure using the quantile-based discretized data. Finally, through numerical experiments and the application of DAG learning algorithms to discretized MLB data, the proposed algorithm is demonstrated to significantly outperform the state-of-the-art DAG model learning algorithms.

{"title":"Discretization: Privacy-preserving data publishing for causal discovery","authors":"Youngmin Ahn , Woongjoon Park , Gunwoong Park","doi":"10.1016/j.csda.2025.108174","DOIUrl":"10.1016/j.csda.2025.108174","url":null,"abstract":"<div><div>As the importance of data privacy continues to grow, data masking has emerged as a crucial method. Notably, data masking techniques aim to protect individual privacy, while enabling data analysts to derive meaningful statistical results, such as the identification of directional or causal relationships between variables. Hence, this study demonstrates the advantages of a quantile-based discretization for protecting privacy and uncovering the relationships between variables in Gaussian directed acyclic graphical (DAG) models. Specifically, it introduces quantile-discretized Gaussian DAG models where each node variable is discretized based on the quantiles. Additionally, it proposes the bi-partition process, which aids in recovering the covariance matrix; hence, the models can be identifiable. Furthermore, a consistent algorithm is developed for learning the underlying structure using the quantile-based discretized data. Finally, through numerical experiments and the application of DAG learning algorithms to discretized MLB data, the proposed algorithm is demonstrated to significantly outperform the state-of-the-art DAG model learning algorithms.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108174"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient regularized estimation of graphical proportional hazards model with interval-censored data

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108178

Huimin Lu , Yilong Wang , Heming Bing , Shuying Wang , Niya Li

Variable selection is discussed in many cases in survival analysis. In particular, the analysis of using proportional hazards (PH) models to deal with censored survival data has established a large amount of literature. Based on interval-censored data, this paper discusses the situation of complex network structures existing in covariates. To address the issue, a more flexible and versatile PH model has been developed by combining probabilistic graphical models with PH models, to describe the correlation between covariates. Based on the block coordinate descent method, a penalized estimation method is proposed, which can simultaneously perform variable selection and parameter estimation. The effectiveness of the proposed model and its parameter estimation method are evaluated through simulation studies and the analysis of clinical trial data related to Alzheimer's disease, confirming the reliability and accuracy of the proposed model and method.

引用次数: 0

Linear covariance selection model via ℓ1-penalization

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-03-27 DOI: 10.1016/j.csda.2025.108176

Kwan-Young Bak , Seongoh Park

This paper presents a study on an

ℓ_{1}

-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing

ℓ_{1}

-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.

{"title":"Linear covariance selection model via ℓ1-penalization","authors":"Kwan-Young Bak , Seongoh Park","doi":"10.1016/j.csda.2025.108176","DOIUrl":"10.1016/j.csda.2025.108176","url":null,"abstract":"<div><div>This paper presents a study on an <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalized covariance regression method. Conventional approaches in high-dimensional covariance estimation often lack the flexibility to integrate external information. As a remedy, we adopt the regression-based covariance modeling framework and introduce a linear covariance selection model (LCSM) to encompass a broader spectrum of covariance structures when covariate information is available. Unlike existing methods, we do not assume that the true covariance matrix can be exactly represented by a linear combination of known basis matrices. Instead, we adopt additional basis matrices for a portion of the covariance patterns not captured by the given bases. To estimate high-dimensional regression coefficients, we exploit the sparsity-inducing <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-penalization scheme. Our theoretical analyses are based on the (symmetric) matrix regression model with additive random error matrix, which allows us to establish new non-asymptotic convergence rates of the proposed covariance estimator. The proposed method is implemented with the coordinate descent algorithm. We conduct empirical evaluation on simulated data to complement theoretical findings and underscore the efficacy of our approach. To show a practical applicability of our method, we further apply it to the co-expression analysis of liver gene expression data where the given basis corresponds to the adjacency matrix of the co-expression network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108176"},"PeriodicalIF":1.5,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A deflation-adjusted Bayesian information criterion for selecting the number of clusters in K-means clustering

IF 1.5 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis

Pub Date : 2025-03-26 DOI: 10.1016/j.csda.2025.108170

Masao Ueki

A deflation-adjusted Bayesian information criterion is proposed by introducing a closed-form adjustment to the variance estimate after K-means clustering. An expected lower bound of the deflation in the variance estimate after K-means clustering is derived and used as an adjustment factor for the variance estimates. The deflation-adjusted variance estimates are applied to the Bayesian information criterion under the Gaussian model for selecting the number of clusters. The closed-form expression makes the proposed deflation-adjusted Bayesian information criterion computationally efficient. Simulation studies show that the deflation-adjusted Bayesian information criterion performs better than other existing clustering methods in some situations, including K-means clustering with the number of clusters selected by standard Bayesian information criteria, the gap statistic, the average silhouette score, the prediction strength, and clustering using a Gaussian mixture model with the Bayesian information criterion. The proposed method is illustrated through a real data application for clustering human genomic data from the 1000 Genomes Project.

{"title":"A deflation-adjusted Bayesian information criterion for selecting the number of clusters in K-means clustering","authors":"Masao Ueki","doi":"10.1016/j.csda.2025.108170","DOIUrl":"10.1016/j.csda.2025.108170","url":null,"abstract":"<div><div>A deflation-adjusted Bayesian information criterion is proposed by introducing a closed-form adjustment to the variance estimate after K-means clustering. An expected lower bound of the deflation in the variance estimate after K-means clustering is derived and used as an adjustment factor for the variance estimates. The deflation-adjusted variance estimates are applied to the Bayesian information criterion under the Gaussian model for selecting the number of clusters. The closed-form expression makes the proposed deflation-adjusted Bayesian information criterion computationally efficient. Simulation studies show that the deflation-adjusted Bayesian information criterion performs better than other existing clustering methods in some situations, including K-means clustering with the number of clusters selected by standard Bayesian information criteria, the gap statistic, the average silhouette score, the prediction strength, and clustering using a Gaussian mixture model with the Bayesian information criterion. The proposed method is illustrated through a real data application for clustering human genomic data from the 1000 Genomes Project.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"209 ","pages":"Article 108170"},"PeriodicalIF":1.5,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational Statistics & Data Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀