Pub Date : 2026-01-01Epub Date: 2025-12-24DOI: 10.1007/s00180-025-01696-1
Ami Sheth, Aaron Smith, Andrew J Holbrook
Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require burdensome [Formula: see text] floating-point operations, where N is the number of data points. Thus, BMDS becomes impractical as N grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between [Formula: see text] and N. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to: (1) the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks and (2) the clustering of ArXiv manuscripts based on low-dimensional representations of article abstracts. In the first application, sBMDS contributes to holistic uncertainty quantification within a larger Bayesian hierarchical model. In the second, sBMDS approximates uncertainty quantification for a downstream modeling task.
{"title":"Sparse Bayesian multidimensional scaling(s).","authors":"Ami Sheth, Aaron Smith, Andrew J Holbrook","doi":"10.1007/s00180-025-01696-1","DOIUrl":"10.1007/s00180-025-01696-1","url":null,"abstract":"<p><p>Bayesian multidimensional scaling (BMDS) is a probabilistic dimension reduction tool that allows one to model and visualize data consisting of dissimilarities between pairs of objects. Although BMDS has proven useful within, e.g., Bayesian phylogenetic inference, its likelihood and gradient calculations require burdensome [Formula: see text] floating-point operations, where <i>N</i> is the number of data points. Thus, BMDS becomes impractical as <i>N</i> grows large. We propose and compare two sparse versions of BMDS (sBMDS) that apply log-likelihood and gradient computations to subsets of the observed dissimilarity matrix data. Landmark sBMDS (L-sBMDS) extracts columns, while banded sBMDS (B-sBMDS) extracts diagonals of the data. These sparse variants let one specify a time complexity between [Formula: see text] and <i>N</i>. Under simplified settings, we prove posterior consistency for subsampled distance matrices. Through simulations, we examine the accuracy and computational efficiency across all models using both the Metropolis-Hastings and Hamiltonian Monte Carlo algorithms. We observe approximately 3-fold, 10-fold and 40-fold speedups with negligible loss of accuracy, when applying the sBMDS likelihoods and gradients to 500, 1000 and 5,000 data points with 50 bands (landmarks); these speedups only increase with the size of data considered. Finally, we apply the sBMDS variants to: (1) the phylogeographic modeling of multiple influenza subtypes to better understand how these strains spread through global air transportation networks and (2) the clustering of ArXiv manuscripts based on low-dimensional representations of article abstracts. In the first application, sBMDS contributes to holistic uncertainty quantification within a larger Bayesian hierarchical model. In the second, sBMDS approximates uncertainty quantification for a downstream modeling task.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"41 1","pages":"12"},"PeriodicalIF":1.4,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12738595/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145851501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-03-08DOI: 10.1007/s00180-025-01606-5
Jon Devlin, Agnieszka Borowska, Dirk Husmeier, John Mackenzie
In this article we explore parameter inference in a novel hybrid discrete-continuum model describing the movement of a population of cells in response to a self-generated chemotactic gradient. The model employs a drift-diffusion stochastic process, rendering likelihood-based inference methods impractical. Consequently, we consider approximate Bayesian computation (ABC) methods, which have gained popularity for models with intractable or computationally expensive likelihoods. ABC involves simulating from the generative model, using parameters from generated observations that are "close enough" to the true data to approximate the posterior distribution. Given the plethora of existing ABC methods, selecting the most suitable one for a specific problem can be challenging. To address this, we employ a simple drift-diffusion stochastic differential equation (SDE) as a benchmark problem. This allows us to assess the accuracy of popular ABC algorithms under known configurations. We also evaluate the bias between ABC-posteriors and the exact posterior for the basic SDE model, where the posterior distribution is tractable. The top-performing ABC algorithms are subsequently applied to the proposed cell movement model to infer its key parameters. This study not only contributes to understanding cell movement but also sheds light on the comparative efficiency of different ABC algorithms in a well-defined context.
{"title":"Approximate Bayesian inference in a model for self-generated gradient collective cell movement.","authors":"Jon Devlin, Agnieszka Borowska, Dirk Husmeier, John Mackenzie","doi":"10.1007/s00180-025-01606-5","DOIUrl":"10.1007/s00180-025-01606-5","url":null,"abstract":"<p><p>In this article we explore parameter inference in a novel hybrid discrete-continuum model describing the movement of a population of cells in response to a self-generated chemotactic gradient. The model employs a drift-diffusion stochastic process, rendering likelihood-based inference methods impractical. Consequently, we consider approximate Bayesian computation (ABC) methods, which have gained popularity for models with intractable or computationally expensive likelihoods. ABC involves simulating from the generative model, using parameters from generated observations that are \"close enough\" to the true data to approximate the posterior distribution. Given the plethora of existing ABC methods, selecting the most suitable one for a specific problem can be challenging. To address this, we employ a simple drift-diffusion stochastic differential equation (SDE) as a benchmark problem. This allows us to assess the accuracy of popular ABC algorithms under known configurations. We also evaluate the bias between ABC-posteriors and the exact posterior for the basic SDE model, where the posterior distribution is tractable. The top-performing ABC algorithms are subsequently applied to the proposed cell movement model to infer its key parameters. This study not only contributes to understanding cell movement but also sheds light on the comparative efficiency of different ABC algorithms in a well-defined context.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 7","pages":"3399-3452"},"PeriodicalIF":1.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12255578/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144638687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-05-25DOI: 10.1007/s00180-025-01635-0
Cornelia Fuetterer, Malte Nalenz, Thomas Augustin, Ruth M Pfeiffer
Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the discriminative power lasso (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number p of (correlated) predictors was much larger than the available sample size N, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when , DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets.
Supplementary information: The online version contains supplementary material available at 10.1007/s00180-025-01635-0.
{"title":"A powerful penalized multinomial logistic regression approach.","authors":"Cornelia Fuetterer, Malte Nalenz, Thomas Augustin, Ruth M Pfeiffer","doi":"10.1007/s00180-025-01635-0","DOIUrl":"10.1007/s00180-025-01635-0","url":null,"abstract":"<p><p>Penalized regression methods that shrink model coefficients are popular approaches to improve prediction and for variable selection in high-dimensional settings. We present a penalized (or regularized) regression approach for multinomial logistic models for categorical outcomes with a novel adaptive L1-type penalty term, that incorporates weights based on intra- and inter-outcome category distances of each predictor. A predictor that has large between- and small within-outcome category distances is penalized less and has a higher likelihood to be selected for the final model. We propose and study three measures for weight calculation: an analysis of variance (ANOVA)-based measure and two indices used in clustering approaches. Our novel approach, that we term the <i>discriminative power lasso</i> (DP-lasso), thus combines elements of marginal screening with regularized regression methods. We studied the performance of DP-lasso and other published methods in simulations with varying numbers of outcome categories, numbers of predictors, strengths of associations and predictor correlation structures. For correlated predictors, the DP-lasso approach with ANOVA based weights (DPan) resulted in much sparser models than other regularization approaches, especially in high-dimensional settings. When the number <i>p</i> of (correlated) predictors was much larger than the available sample size <i>N</i>, DPan had the highest true positive rate while maintaining low false positive rates for all simulation settings. Similarly, when <math><mrow><mi>p</mi> <mo><</mo> <mi>N</mi></mrow> </math> , DPan had high true positive rates and the lowest false positive rates of all methods studied. Thus we recommend DPan for analysing categorical outcomes in relation to high-dimensional predictors. We further illustrate all approaches in ultra high-dimensional settings, using several single-cell RNA-sequencing datasets.</p><p><strong>Supplementary information: </strong>The online version contains supplementary material available at 10.1007/s00180-025-01635-0.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 8","pages":"4565-4587"},"PeriodicalIF":1.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12552268/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2025-05-03DOI: 10.1007/s00180-025-01607-4
Owen Thomas, Raquel Sá-Leão, Hermínia de Lencastre, Samuel Kaski, Jukka Corander, Henri Pesonen
Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions, discrepancies, and associated summary statistics for distinct subsets of the parameters. The efficient additive acquisition structure is combined with exponentiated loss-likelihood to provide a misspecification-robust characterisation of posterior distributions for subsets of model parameters. The method successfully performs computationally efficient inference in a moderately sized parameter space and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space.
{"title":"Misspecification-robust likelihood-free inference in high dimensions.","authors":"Owen Thomas, Raquel Sá-Leão, Hermínia de Lencastre, Samuel Kaski, Jukka Corander, Henri Pesonen","doi":"10.1007/s00180-025-01607-4","DOIUrl":"10.1007/s00180-025-01607-4","url":null,"abstract":"<p><p>Likelihood-free inference for simulator-based statistical models has developed rapidly from its infancy to a useful tool for practitioners. However, models with more than a handful of parameters still generally remain a challenge for the Approximate Bayesian Computation (ABC) based inference. To advance the possibilities for performing likelihood-free inference in higher dimensional parameter spaces, we introduce an extension of the popular Bayesian optimisation based approach to approximate discrepancy functions in a probabilistic manner which lends itself to an efficient exploration of the parameter space. Our approach achieves computational scalability for higher dimensional parameter spaces by using separate acquisition functions, discrepancies, and associated summary statistics for distinct subsets of the parameters. The efficient additive acquisition structure is combined with exponentiated loss-likelihood to provide a misspecification-robust characterisation of posterior distributions for subsets of model parameters. The method successfully performs computationally efficient inference in a moderately sized parameter space and compares favourably to existing modularised ABC methods. We further illustrate the potential of this approach by fitting a bacterial transmission dynamics model to a real data set, which provides biologically coherent results on strain competition in a 30-dimensional parameter space.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"40 8","pages":"4399-4439"},"PeriodicalIF":1.4,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12552272/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1007/s00180-024-01554-6
Ankur Chakraborty, Nabakumar Jana
We consider two inverse Gaussian populations with a common mean but different scale-like parameters, where all parameters are unknown. We construct noninformative priors for the ratio of the scale-like parameters to derive matching priors of different orders. Reference priors are proposed for different groups of parameters. The Bayes estimators of the common mean and ratio of the scale-like parameters are also derived. We propose confidence intervals of the conditional error rate in classifying an observation into inverse Gaussian distributions. A generalized variable-based confidence interval and the highest posterior density credible intervals for the error rate are computed. We estimate parameters of the mixture of these inverse Gaussian distributions and obtain estimates of the expected probability of correct classification. An intensive simulation study has been carried out to compare the estimators and expected probability of correct classification. Real data-based examples are given to show the practicality and effectiveness of the estimators.
{"title":"Bayes estimation of ratio of scale-like parameters for inverse Gaussian distributions and applications to classification","authors":"Ankur Chakraborty, Nabakumar Jana","doi":"10.1007/s00180-024-01554-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01554-6","url":null,"abstract":"<p>We consider two inverse Gaussian populations with a common mean but different scale-like parameters, where all parameters are unknown. We construct noninformative priors for the ratio of the scale-like parameters to derive matching priors of different orders. Reference priors are proposed for different groups of parameters. The Bayes estimators of the common mean and ratio of the scale-like parameters are also derived. We propose confidence intervals of the conditional error rate in classifying an observation into inverse Gaussian distributions. A generalized variable-based confidence interval and the highest posterior density credible intervals for the error rate are computed. We estimate parameters of the mixture of these inverse Gaussian distributions and obtain estimates of the expected probability of correct classification. An intensive simulation study has been carried out to compare the estimators and expected probability of correct classification. Real data-based examples are given to show the practicality and effectiveness of the estimators.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"50 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1007/s00180-024-01553-7
Antonello D’Ambra, Pietro Amenta, Antonio Lucadamo
Compared to other European competitions, participation in the Uefa Champions League is a real “bargain” for football clubs due to the hefty bonuses awarded based on performance during the group qualification phase. To perform successfully in football depends on several multidimensional factors, and analyzing the main ones remains challenging. In the performance study, little attention has been paid to teams’ behavior when playing at home and away. Our study combines statistical techniques to develop a procedure to examine teams’ performance. Several considerations make the 2022–2023 Serie A league season particularly interesting to analyze with our approach. Except for Napoli, all the teams showed different home-and-away behaviors concerning the results obtained at the season’s end. Ball possession and corners have positively influenced scored points in both home and away games with a different impact. The precision indicator was not an essential variable. The procedure highlighted the negative roles played by offside, as well as yellow and red cards.
{"title":"Multivariate approaches to investigate the home and away behavior of football teams playing football matches","authors":"Antonello D’Ambra, Pietro Amenta, Antonio Lucadamo","doi":"10.1007/s00180-024-01553-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01553-7","url":null,"abstract":"<p>Compared to other European competitions, participation in the Uefa Champions League is a real “bargain” for football clubs due to the hefty bonuses awarded based on performance during the group qualification phase. To perform successfully in football depends on several multidimensional factors, and analyzing the main ones remains challenging. In the performance study, little attention has been paid to teams’ behavior when playing at home and away. Our study combines statistical techniques to develop a procedure to examine teams’ performance. Several considerations make the 2022–2023 Serie A league season particularly interesting to analyze with our approach. Except for Napoli, all the teams showed different home-and-away behaviors concerning the results obtained at the season’s end. Ball possession and corners have positively influenced scored points in both home and away games with a different impact. The precision indicator was not an essential variable. The procedure highlighted the negative roles played by offside, as well as yellow and red cards.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"2 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1007/s00180-024-01542-w
Roy Cerqueti, Raffaele Mattera, Valerio Ficcadenti
This paper deals with the challenging themes of the way sporting teams and athletes are ranked in sports competitions. Starting from the paradigmatic case of soccer, we advance a new method for ranking teams in the official national championships through computational statistics methods based on Kendall correlations and radar charts. In detail, we consider the goals for and against the teams in the individual matches as a further source of score assignment beyond the usual win-tie-lose trichotomy. Our approach overcomes some biases in the scoring rules that are currently employed. The methodological proposal is tested over the relevant case of the Italian “Serie A” championships played during 1930–2023.
{"title":"Kendall correlations and radar charts to include goals for and goals against in soccer rankings","authors":"Roy Cerqueti, Raffaele Mattera, Valerio Ficcadenti","doi":"10.1007/s00180-024-01542-w","DOIUrl":"https://doi.org/10.1007/s00180-024-01542-w","url":null,"abstract":"<p>This paper deals with the challenging themes of the way sporting teams and athletes are ranked in sports competitions. Starting from the paradigmatic case of soccer, we advance a new method for ranking teams in the official national championships through computational statistics methods based on Kendall correlations and radar charts. In detail, we consider the goals for and against the teams in the individual matches as a further source of score assignment beyond the usual win-tie-lose trichotomy. Our approach overcomes some biases in the scoring rules that are currently employed. The methodological proposal is tested over the relevant case of the Italian “Serie A” championships played during 1930–2023.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00180-024-01546-6
Ranran Chen, Mai Dao, Keying Ye, Min Wang
In this paper, we develop a fully Bayesian adaptive lasso quantile regression model to analyze data with non-ignorable missing responses, which frequently occur in various fields of study. Specifically, we employ a logistic regression model to deal with missing data of non-ignorable mechanism. By using the asymmetric Laplace working likelihood for the data and specifying Laplace priors for the regression coefficients, our proposed method extends the Bayesian lasso framework by imposing specific penalization parameters on each regression coefficient, enhancing our estimation and variable selection capability. Furthermore, we embrace the normal-exponential mixture representation of the asymmetric Laplace distribution and the Student-t approximation of the logistic regression model to develop a simple and efficient Gibbs sampling algorithm for generating posterior samples and making statistical inferences. The finite-sample performance of the proposed algorithm is investigated through various simulation studies and a real-data example.
{"title":"Bayesian adaptive lasso quantile regression with non-ignorable missing responses","authors":"Ranran Chen, Mai Dao, Keying Ye, Min Wang","doi":"10.1007/s00180-024-01546-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01546-6","url":null,"abstract":"<p>In this paper, we develop a fully Bayesian adaptive lasso quantile regression model to analyze data with non-ignorable missing responses, which frequently occur in various fields of study. Specifically, we employ a logistic regression model to deal with missing data of non-ignorable mechanism. By using the asymmetric Laplace working likelihood for the data and specifying Laplace priors for the regression coefficients, our proposed method extends the Bayesian lasso framework by imposing specific penalization parameters on each regression coefficient, enhancing our estimation and variable selection capability. Furthermore, we embrace the normal-exponential mixture representation of the asymmetric Laplace distribution and the Student-<i>t</i> approximation of the logistic regression model to develop a simple and efficient Gibbs sampling algorithm for generating posterior samples and making statistical inferences. The finite-sample performance of the proposed algorithm is investigated through various simulation studies and a real-data example.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"94 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1007/s00180-024-01543-9
Tarn Duong
Kernel smoothers are essential tools for data analysis due to their ability to convey complex statistical information with concise graphical visualisations. Their inclusion in the base distribution and in the many user-contributed add-on packages of the R statistical analysis environment caters well to many practitioners. Though there remain some important gaps for specialised data, most notably for tidy and geospatial data. The proposed eks package fills in these gaps. In addition to kernel density estimation, this package also caters for more complex data analysis situations, such as density derivative estimation, density-based classification (supervised learning) and mean shift clustering (unsupervised learning). We illustrate with experimental data how to obtain and to interpret the statistical visualisations for these kernel smoothing methods.
{"title":"Statistical visualisation of tidy and geospatial data in R via kernel smoothing methods in the eks package","authors":"Tarn Duong","doi":"10.1007/s00180-024-01543-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01543-9","url":null,"abstract":"<p>Kernel smoothers are essential tools for data analysis due to their ability to convey complex statistical information with concise graphical visualisations. Their inclusion in the base distribution and in the many user-contributed add-on packages of the <span>R</span> statistical analysis environment caters well to many practitioners. Though there remain some important gaps for specialised data, most notably for tidy and geospatial data. The proposed <span>eks</span> package fills in these gaps. In addition to kernel density estimation, this package also caters for more complex data analysis situations, such as density derivative estimation, density-based classification (supervised learning) and mean shift clustering (unsupervised learning). We illustrate with experimental data how to obtain and to interpret the statistical visualisations for these kernel smoothing methods.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"119 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-12DOI: 10.1007/s00180-024-01545-7
Tommy Löfstedt
Partial least squares regression (PLS-R) has been an important regression method in the life sciences and many other fields for decades. However, PLS-R is typically solved using an opaque algorithmic approach, rather than through an optimisation formulation and procedure. There is a clear optimisation formulation of the PLS-R problem based on a Krylov subspace formulation, but it is only rarely considered. The popularity of PLS-R is attributed to the ability to interpret the data through the model components, but the model components are not available when solving the PLS-R problem using the Krylov subspace formulation. We therefore highlight a simple reformulation of the PLS-R problem using the Krylov subspace formulation as a promising modelling framework for PLS-R, and illustrate one of the main benefits of this reformulation—that it allows arbitrary penalties of the regression coefficients in the PLS-R model. Further, we propose an approach to estimate the PLS-R model components for the solution found through the Krylov subspace formulation, that are those we would have obtained had we been able to use the common algorithms for estimating the PLS-R model. We illustrate the utility of the proposed method on simulated and real data.
{"title":"Using the Krylov subspace formulation to improve regularisation and interpretation in partial least squares regression","authors":"Tommy Löfstedt","doi":"10.1007/s00180-024-01545-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01545-7","url":null,"abstract":"<p>Partial least squares regression (PLS-R) has been an important regression method in the life sciences and many other fields for decades. However, PLS-R is typically solved using an opaque algorithmic approach, rather than through an optimisation formulation and procedure. There is a clear optimisation formulation of the PLS-R problem based on a Krylov subspace formulation, but it is only rarely considered. The popularity of PLS-R is attributed to the ability to interpret the data through the model components, but the model components are not available when solving the PLS-R problem using the Krylov subspace formulation. We therefore highlight a simple reformulation of the PLS-R problem using the Krylov subspace formulation as a promising modelling framework for PLS-R, and illustrate one of the main benefits of this reformulation—that it allows arbitrary penalties of the regression coefficients in the PLS-R model. Further, we propose an approach to estimate the PLS-R model components for the solution found through the Krylov subspace formulation, that are those we would have obtained had we been able to use the common algorithms for estimating the PLS-R model. We illustrate the utility of the proposed method on simulated and real data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"25 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142186113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}