Pub Date : 2024-05-06DOI: 10.1016/j.csda.2024.107976
Xiao Li , Takeru Matsuda , Fumiyasu Komaki
An empirical Bayes method for the Poisson matrix denoising and completion problems is proposed, and a corresponding algorithm called EBPM (Empirical Bayes Poisson Matrix) is developed. This approach is motivated by the non-central singular value shrinkage prior, which was used for the estimation of the mean matrix parameter of a matrix-variate normal distribution. Numerical experiments show that the EBPM algorithm outperforms the common nuclear norm penalized method in both matrix denoising and completion. The EBPM algorithm is highly efficient and does not require heuristic parameter tuning, as opposed to the nuclear norm penalized method, in which the regularization parameter should be selected. The EBPM algorithm also performs better than others in real-data applications.
{"title":"Empirical Bayes Poisson matrix completion","authors":"Xiao Li , Takeru Matsuda , Fumiyasu Komaki","doi":"10.1016/j.csda.2024.107976","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107976","url":null,"abstract":"<div><p>An empirical Bayes method for the Poisson matrix denoising and completion problems is proposed, and a corresponding algorithm called EBPM (Empirical Bayes Poisson Matrix) is developed. This approach is motivated by the non-central singular value shrinkage prior, which was used for the estimation of the mean matrix parameter of a matrix-variate normal distribution. Numerical experiments show that the EBPM algorithm outperforms the common nuclear norm penalized method in both matrix denoising and completion. The EBPM algorithm is highly efficient and does not require heuristic parameter tuning, as opposed to the nuclear norm penalized method, in which the regularization parameter should be selected. The EBPM algorithm also performs better than others in real-data applications.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107976"},"PeriodicalIF":1.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000604/pdfft?md5=1823ebfe249fd22a2c430281b6468d2f&pid=1-s2.0-S0167947324000604-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140880266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1016/j.csda.2024.107975
Pengcheng Xiang , Ling Zhou , Lu Tang
A one-shot federated transfer learning method using random forests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.
{"title":"Transfer learning via random forests: A one-shot federated approach","authors":"Pengcheng Xiang , Ling Zhou , Lu Tang","doi":"10.1016/j.csda.2024.107975","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107975","url":null,"abstract":"<div><p>A one-shot <u>f</u>ederated <u>t</u>ransfer learning method using <u>r</u>andom <u>f</u>orests (FTRF) is developed to improve the prediction accuracy at a target data site by leveraging information from auxiliary sites. Both theoretical and numerical results show that the proposed federated transfer learning approach is at least as accurate as the model trained on the target data alone regardless of possible data heterogeneity, which includes imbalanced and non-IID data distributions across sites and model mis-specification. FTRF has the ability to evaluate the similarity between the target and auxiliary sites, enabling the target site to autonomously select more similar site information to enhance its predictive performance. To ensure communication efficiency, FTRF adopts the model averaging idea that requires a single round of communication between the target and the auxiliary sites. Only fitted models from auxiliary sites are sent to the target site. Unlike traditional model averaging, FTRF incorporates predicted outcomes from other sites and the original variables when estimating model averaging weights, resulting in a variable-dependent weighting to better utilize models from auxiliary sites to improve prediction. Five real-world data examples show that FTRF reduces the prediction error by 2-40% compared to methods not utilizing auxiliary information.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107975"},"PeriodicalIF":1.8,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140894019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1016/j.csda.2024.107973
Panxu Yuan, Changhan Jin, Gaorong Li
Linear log-contrast models have been widely used to describe the relationship between the response of interest and the compositional covariates, in which one central task is to identify the significant compositional covariates while controlling the false discovery rate (FDR) at a nominal level. To achieve this goal, a new FDR control method is proposed for linear log-contrast models with high-dimensional compositional covariates. An appealing feature of the proposed method is that it completely bypasses the traditional p-values and utilizes only the symmetry property of the test statistic for the unimportant compositional covariates to give an upper bound of the FDR. Under some regularity conditions, the FDR can be asymptotically controlled at the nominal level for the proposed method in theory, and the theoretical power is also proven to approach one as the sample size tends to infinity. The finite-sample performance of the proposed method is evaluated through extensive simulation studies, and applications to microbiome compositional datasets are also provided.
线性对数对比模型已被广泛用于描述相关响应与组成协变量之间的关系,其中的一个核心任务是识别重要的组成协变量,同时将误诊率(FDR)控制在名义水平。为了实现这一目标,我们针对具有高维组成协变量的线性对数对比模型提出了一种新的 FDR 控制方法。所提方法的一个吸引人的特点是,它完全绕过了传统的 p 值,只利用不重要的组成协变量的检验统计量的对称性来给出 FDR 的上界。在某些规则性条件下,所提方法的 FDR 可以在理论上渐进地控制在标称水平,而且当样本量趋于无穷大时,理论功率也被证明接近于 1。通过大量的模拟研究评估了所提方法的有限样本性能,并将其应用于微生物组成分数据集。
{"title":"FDR control for linear log-contrast models with high-dimensional compositional covariates","authors":"Panxu Yuan, Changhan Jin, Gaorong Li","doi":"10.1016/j.csda.2024.107973","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107973","url":null,"abstract":"<div><p>Linear log-contrast models have been widely used to describe the relationship between the response of interest and the compositional covariates, in which one central task is to identify the significant compositional covariates while controlling the false discovery rate (FDR) at a nominal level. To achieve this goal, a new FDR control method is proposed for linear log-contrast models with high-dimensional compositional covariates. An appealing feature of the proposed method is that it completely bypasses the traditional p-values and utilizes only the symmetry property of the test statistic for the unimportant compositional covariates to give an upper bound of the FDR. Under some regularity conditions, the FDR can be asymptotically controlled at the nominal level for the proposed method in theory, and the theoretical power is also proven to approach one as the sample size tends to infinity. The finite-sample performance of the proposed method is evaluated through extensive simulation studies, and applications to microbiome compositional datasets are also provided.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107973"},"PeriodicalIF":1.8,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140878933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1016/j.csda.2024.107974
Sarah Samorodnitsky , Chris H. Wendt , Eric F. Lock
Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including “blockwise” missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.
{"title":"Bayesian simultaneous factorization and prediction using multi-omic data","authors":"Sarah Samorodnitsky , Chris H. Wendt , Eric F. Lock","doi":"10.1016/j.csda.2024.107974","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107974","url":null,"abstract":"<div><p>Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including “blockwise” missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107974"},"PeriodicalIF":1.8,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140905435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-30DOI: 10.1016/j.csda.2024.107971
Peng Su , Garth Tarr , Samuel Muller , Suojin Wang
Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. A robust Lasso-type cellwise regularization procedure is proposed which is coined CR-Lasso, that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. The evaluation of this approach involves simulation studies that compare its selection and prediction performance with several sparse regression methods. The results demonstrate that CR-Lasso is competitive within the considered settings. The effectiveness of the proposed method is further illustrated through an analysis of a bone mineral density dataset.
{"title":"CR-Lasso: Robust cellwise regularized sparse regression","authors":"Peng Su , Garth Tarr , Samuel Muller , Suojin Wang","doi":"10.1016/j.csda.2024.107971","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107971","url":null,"abstract":"<div><p>Cellwise contamination remains a challenging problem for data scientists, particularly in research fields that require the selection of sparse features. Traditional robust methods may not be feasible nor efficient in dealing with such contaminated datasets. A robust Lasso-type cellwise regularization procedure is proposed which is coined CR-Lasso, that performs feature selection in the presence of cellwise outliers by minimising a regression loss and cell deviation measure simultaneously. The evaluation of this approach involves simulation studies that compare its selection and prediction performance with several sparse regression methods. The results demonstrate that CR-Lasso is competitive within the considered settings. The effectiveness of the proposed method is further illustrated through an analysis of a bone mineral density dataset.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"197 ","pages":"Article 107971"},"PeriodicalIF":1.8,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000550/pdfft?md5=7f097cb47b472d8dfd0dc105cc9fcafa&pid=1-s2.0-S0167947324000550-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140822443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-23DOI: 10.1016/j.csda.2024.107972
Sylvain Barde
Large scale, computationally expensive simulation models pose a particular challenge when it comes to estimating their parameters from empirical data. Most simulation models do not possess closed-form expressions for their likelihood function, requiring the use of simulation-based inference, such as simulated method of moments, indirect inference, likelihood-free inference or approximate Bayesian computation. However, given the high computational requirements of large-scale models, it is often difficult to run these estimation methods, as they require more simulated runs that can feasibly be carried out. The aim is to address the problem by providing a full Bayesian estimation framework where the true but intractable likelihood function of the simulation model is replaced by one generated by a surrogate model trained on the limited simulated data. This is provided by a Linear Model of Coregionalization, where each latent variable is a sparse variational Gaussian process, chosen for its desirable convergence and consistency properties. The effectiveness of the approach is tested using both a simulated Bayesian computing analysis on a known data generating process, and an empirical application in which the free parameters of a computationally demanding agent-based model are estimated on US macroeconomic data.
{"title":"Bayesian estimation of large-scale simulation models with Gaussian process regression surrogates","authors":"Sylvain Barde","doi":"10.1016/j.csda.2024.107972","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107972","url":null,"abstract":"<div><p>Large scale, computationally expensive simulation models pose a particular challenge when it comes to estimating their parameters from empirical data. Most simulation models do not possess closed-form expressions for their likelihood function, requiring the use of simulation-based inference, such as simulated method of moments, indirect inference, likelihood-free inference or approximate Bayesian computation. However, given the high computational requirements of large-scale models, it is often difficult to run these estimation methods, as they require more simulated runs that can feasibly be carried out. The aim is to address the problem by providing a full Bayesian estimation framework where the true but intractable likelihood function of the simulation model is replaced by one generated by a surrogate model trained on the limited simulated data. This is provided by a Linear Model of Coregionalization, where each latent variable is a sparse variational Gaussian process, chosen for its desirable convergence and consistency properties. The effectiveness of the approach is tested using both a simulated Bayesian computing analysis on a known data generating process, and an empirical application in which the free parameters of a computationally demanding agent-based model are estimated on US macroeconomic data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107972"},"PeriodicalIF":1.8,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000562/pdfft?md5=b53b8e5e84e9796eca1b2069b126ea59&pid=1-s2.0-S0167947324000562-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140644251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-16DOI: 10.1016/j.csda.2024.107970
Bérénice-Alexia Jocteur , Véronique Maume-Deschamps , Pierre Ribereau
Estimates of causal effects are needed to answer what-if questions about shifts in policy, such as new treatments in pharmacology or new pricing strategies for business owners. A new non-parametric approach is proposed to estimate the heterogeneous treatment effect based on random forests (HTERF). The potential outcome framework with unconfoundedness shows that the HTERF is pointwise almost surely consistent with the true treatment effect. Interpretability results are also presented.
{"title":"Heterogeneous Treatment Effect-based Random Forest: HTERF","authors":"Bérénice-Alexia Jocteur , Véronique Maume-Deschamps , Pierre Ribereau","doi":"10.1016/j.csda.2024.107970","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107970","url":null,"abstract":"<div><p>Estimates of causal effects are needed to answer what-if questions about shifts in policy, such as new treatments in pharmacology or new pricing strategies for business owners. A new non-parametric approach is proposed to estimate the heterogeneous treatment effect based on random forests (HTERF). The potential outcome framework with unconfoundedness shows that the HTERF is pointwise almost surely consistent with the true treatment effect. Interpretability results are also presented.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107970"},"PeriodicalIF":1.8,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.1016/j.csda.2024.107960
Seungchul Baek , Hoyoung Park , Junyong Park
Sufficient dimension reduction (SDR) is such an effective way to detect nonlinear relationship between response variable and covariates by reducing the dimensionality of covariates without information loss. The principal fitted component (PFC) model is a way to implement SDR using some class of basis functions, however the PFC model is not efficient when there are many irrelevant or noisy covariates. There have been a few studies on the selection of variables in the PFC model via penalized regression or sequential likelihood ratio test. A novel variable selection technique in the PFC model has been proposed by incorporating a recent development in multiple testing such as mirror statistics and random data splitting. It is highlighted how we construct a mirror statistic in the PFC model using the idea of projection of coefficients to the other space generated from data splitting. The proposed method is superior to some existing methods in terms of false discovery rate (FDR) control and applicability to high-dimensional cases. In particular, the proposed method outperforms other methods as the number of covariates tends to be getting larger, which would be appealing in high dimensional data analysis. Simulation studies and analyses of real data sets have been conducted to show the finite sample performance and the gain that it yields compared to existing methods.
{"title":"Variable selection using data splitting and projection for principal fitted component models in high dimension","authors":"Seungchul Baek , Hoyoung Park , Junyong Park","doi":"10.1016/j.csda.2024.107960","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107960","url":null,"abstract":"<div><p>Sufficient dimension reduction (SDR) is such an effective way to detect nonlinear relationship between response variable and covariates by reducing the dimensionality of covariates without information loss. The principal fitted component (PFC) model is a way to implement SDR using some class of basis functions, however the PFC model is not efficient when there are many irrelevant or noisy covariates. There have been a few studies on the selection of variables in the PFC model via penalized regression or sequential likelihood ratio test. A novel variable selection technique in the PFC model has been proposed by incorporating a recent development in multiple testing such as mirror statistics and random data splitting. It is highlighted how we construct a mirror statistic in the PFC model using the idea of projection of coefficients to the other space generated from data splitting. The proposed method is superior to some existing methods in terms of false discovery rate (FDR) control and applicability to high-dimensional cases. In particular, the proposed method outperforms other methods as the number of covariates tends to be getting larger, which would be appealing in high dimensional data analysis. Simulation studies and analyses of real data sets have been conducted to show the finite sample performance and the gain that it yields compared to existing methods.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107960"},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-15DOI: 10.1016/j.csda.2024.107961
José E. Chacón , Javier Fernández Serrano
The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.
{"title":"Bayesian taut splines for estimating the number of modes","authors":"José E. Chacón , Javier Fernández Serrano","doi":"10.1016/j.csda.2024.107961","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107961","url":null,"abstract":"<div><p>The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107961"},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000458/pdfft?md5=9c9dde675ebe359be2107f0ce88120f0&pid=1-s2.0-S0167947324000458-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-10DOI: 10.1016/j.csda.2024.107930
Jiayu Qian , Yuanyuan Liu , Jingya Yang , Qingping Zhou
Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.
{"title":"Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler","authors":"Jiayu Qian , Yuanyuan Liu , Jingya Yang , Qingping Zhou","doi":"10.1016/j.csda.2024.107930","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107930","url":null,"abstract":"<div><p>Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"196 ","pages":"Article 107930"},"PeriodicalIF":1.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}