Pub Date : 2016-06-14DOI: 10.1007/978-3-319-11259-6_38-1
Loic Le Gratiet, S. Marelli, B. Sudret
{"title":"Metamodel-based sensitivity analysis: polynomial chaos expansions and Gaussian processes","authors":"Loic Le Gratiet, S. Marelli, B. Sudret","doi":"10.1007/978-3-319-11259-6_38-1","DOIUrl":"https://doi.org/10.1007/978-3-319-11259-6_38-1","url":null,"abstract":"","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84513221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectation-conditional maximization algorithm is adopted to obtain maximum likelihood parameter estimates, and likelihood-based model selection criteria are used to select the model and the number of groups. Parallel computation can be used on multicore PCs and computer clusters, when several models have to be fitted. Differently from the more popular mixtures of multivariate normal and t distributions, this approach also allows for automatic detection of mild outliers via the maximum a posteriori probabilities procedure. To exemplify the use of the package, applications to artificial and real data are presented.
{"title":"ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions","authors":"A. Punzo, A. Mazza, P. McNicholas","doi":"10.18637/JSS.V085.I10","DOIUrl":"https://doi.org/10.18637/JSS.V085.I10","url":null,"abstract":"We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectation-conditional maximization algorithm is adopted to obtain maximum likelihood parameter estimates, and likelihood-based model selection criteria are used to select the model and the number of groups. Parallel computation can be used on multicore PCs and computer clusters, when several models have to be fitted. Differently from the more popular mixtures of multivariate normal and t distributions, this approach also allows for automatic detection of mild outliers via the maximum a posteriori probabilities procedure. To exemplify the use of the package, applications to artificial and real data are presented.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85749629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In probabilistic (Bayesian) inferences, we typically want to compute properties of the posterior distribution, describing knowledge of unknown quantities in the context of a particular dataset and the assumed prior information. The marginal likelihood, also known as the "evidence", is a key quantity in Bayesian model selection. The Diffusive Nested Sampling algorithm, a variant of Nested Sampling, is a powerful tool for generating posterior samples and estimating marginal likelihoods. It is effective at solving complex problems including many where the posterior distribution is multimodal or has strong dependencies between variables. DNest4 is an open source (MIT licensed), multi-threaded implementation of this algorithm in C++11, along with associated utilities including: i) RJObject, a class template for finite mixture models, (ii) A Python package allowing basic use without C++ coding, and iii) Experimental support for models implemented in Julia. In this paper we demonstrate DNest4 usage through examples including simple Bayesian data analysis, finite mixture models, and Approximate Bayesian Computation.
{"title":"DNest4: Diffusive Nested Sampling in C++ and Python","authors":"B. Brewer, D. Foreman-Mackey","doi":"10.18637/JSS.V086.I07","DOIUrl":"https://doi.org/10.18637/JSS.V086.I07","url":null,"abstract":"In probabilistic (Bayesian) inferences, we typically want to compute properties of the posterior distribution, describing knowledge of unknown quantities in the context of a particular dataset and the assumed prior information. The marginal likelihood, also known as the \"evidence\", is a key quantity in Bayesian model selection. The Diffusive Nested Sampling algorithm, a variant of Nested Sampling, is a powerful tool for generating posterior samples and estimating marginal likelihoods. It is effective at solving complex problems including many where the posterior distribution is multimodal or has strong dependencies between variables. DNest4 is an open source (MIT licensed), multi-threaded implementation of this algorithm in C++11, along with associated utilities including: i) RJObject, a class template for finite mixture models, (ii) A Python package allowing basic use without C++ coding, and iii) Experimental support for models implemented in Julia. In this paper we demonstrate DNest4 usage through examples including simple Bayesian data analysis, finite mixture models, and Approximate Bayesian Computation.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76142298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-06-08DOI: 10.1007/978-3-319-73906-9_7
Leonardo Egidi, R. Pappadà, F. Pauli, N. Torelli
{"title":"Maxima Units Search (MUS) algorithm: methodology and applications","authors":"Leonardo Egidi, R. Pappadà, F. Pauli, N. Torelli","doi":"10.1007/978-3-319-73906-9_7","DOIUrl":"https://doi.org/10.1007/978-3-319-73906-9_7","url":null,"abstract":"","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79407847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cartogram drawing is a technique for showing geography-related statistical information, such as demographic and epidemiological data. The idea is to distort a map by resizing its regions according to a statistical parameter by keeping the map recognizable. This article describes an R package implementing an algorithm called RecMap which approximates every map region by a rectangle where the area corresponds to the given statistical value (maintain zero cartographic error). The package implements the computationally intensive tasks in C++. This paper's contribution is that it demonstrates on real and synthetic maps how recmap can be used, how it is implemented and used with other statistical packages.
{"title":"Rectangular Statistical Cartograms in R: The recmap Package","authors":"Christian Panse","doi":"10.18637/jss.v086.c01","DOIUrl":"https://doi.org/10.18637/jss.v086.c01","url":null,"abstract":"Cartogram drawing is a technique for showing geography-related statistical information, such as demographic and epidemiological data. The idea is to distort a map by resizing its regions according to a statistical parameter by keeping the map recognizable. This article describes an R package implementing an algorithm called RecMap which approximates every map region by a rectangle where the area corresponds to the given statistical value (maintain zero cartographic error). The package implements the computationally intensive tasks in C++. This paper's contribution is that it demonstrates on real and synthetic maps how recmap can be used, how it is implemented and used with other statistical packages.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85113791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-05-17DOI: 10.1142/S0217751X17501330
J. Heckman, J. Bernstein, B. Vigoda
Motivated by the physics of strings and branes, we introduce a general suite of Markov chain Monte Carlo (MCMC) "suburban samplers" (i.e., spread out Metropolis). The suburban algorithm involves an ensemble of statistical agents connected together by a random network. Performance of the collective in reaching a fast and accurate inference depends primarily on the average number of nearest neighbor connections. Increasing the average number of neighbors above zero initially leads to an increase in performance, though there is a critical connectivity with effective dimension d_eff ~ 1, above which "groupthink" takes over, and the performance of the sampler declines.
{"title":"MCMC with Strings and Branes: The Suburban Algorithm","authors":"J. Heckman, J. Bernstein, B. Vigoda","doi":"10.1142/S0217751X17501330","DOIUrl":"https://doi.org/10.1142/S0217751X17501330","url":null,"abstract":"Motivated by the physics of strings and branes, we introduce a general suite of Markov chain Monte Carlo (MCMC) \"suburban samplers\" (i.e., spread out Metropolis). The suburban algorithm involves an ensemble of statistical agents connected together by a random network. Performance of the collective in reaching a fast and accurate inference depends primarily on the average number of nearest neighbor connections. Increasing the average number of neighbors above zero initially leads to an increase in performance, though there is a critical connectivity with effective dimension d_eff ~ 1, above which \"groupthink\" takes over, and the performance of the sampler declines.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"1999 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88277987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-03-21DOI: 10.1615/INT.J.UNCERTAINTYQUANTIFICATION.2016018661
A. Jasra, K. Law, Yan Zhou
This paper considers uncertainty quantification for an elliptic nonlocal equation. In particular, it is assumed that the parameters which define the kernel in the nonlocal operator are uncertain and a priori distributed according to a probability measure. It is shown that the induced probability measure on some quantities of interest arising from functionals of the solution to the equation with random inputs is well-defined; as is the posterior distribution on parameters given observations. As the elliptic nonlocal equation cannot be solved approximate posteriors are constructed. The multilevel Monte Carlo (MLMC) and multilevel sequential Monte Carlo (MLSMC) sampling algorithms are used for a priori and a posteriori estimation, respectively, of quantities of interest. These algorithms reduce the amount of work to estimate posterior expectations, for a given level of error, relative to Monte Carlo and i.i.d. sampling from the posterior at a given level of approximation of the solution of the elliptic nonlocal equation.
{"title":"Forward and Inverse Uncertainty Quantification using Multilevel Monte Carlo Algorithms for an Elliptic Nonlocal Equation","authors":"A. Jasra, K. Law, Yan Zhou","doi":"10.1615/INT.J.UNCERTAINTYQUANTIFICATION.2016018661","DOIUrl":"https://doi.org/10.1615/INT.J.UNCERTAINTYQUANTIFICATION.2016018661","url":null,"abstract":"This paper considers uncertainty quantification for an elliptic nonlocal equation. In particular, it is assumed that the parameters which define the kernel in the nonlocal operator are uncertain and a priori distributed according to a probability measure. It is shown that the induced probability measure on some quantities of interest arising from functionals of the solution to the equation with random inputs is well-defined; as is the posterior distribution on parameters given observations. As the elliptic nonlocal equation cannot be solved approximate posteriors are constructed. The multilevel Monte Carlo (MLMC) and multilevel sequential Monte Carlo (MLSMC) sampling algorithms are used for a priori and a posteriori estimation, respectively, of quantities of interest. These algorithms reduce the amount of work to estimate posterior expectations, for a given level of error, relative to Monte Carlo and i.i.d. sampling from the posterior at a given level of approximation of the solution of the elliptic nonlocal equation.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74504173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe the R package kdecopula (current version 0.9.0), which provides fast implementations of various kernel estimators for the copula density. Due to a variety of available plotting options it is particularly useful for the exploratory analysis of dependence structures. It can be further used for accurate nonparametric estimation of copula densities and resampling. The implementation features spline interpolation of the estimates to allow for fast evaluation of density estimates and integrals thereof. We utilize this for a fast renormalization scheme that ensures that estimates are bona fide copula densities and additionally improves the estimators' accuracy. The performance of the methods is illustrated by simulations.
{"title":"kdecopula: An R Package for the Kernel Estimation of Bivariate Copula Densities","authors":"T. Nagler","doi":"10.18637/JSS.V084.I07","DOIUrl":"https://doi.org/10.18637/JSS.V084.I07","url":null,"abstract":"We describe the R package kdecopula (current version 0.9.0), which provides fast implementations of various kernel estimators for the copula density. Due to a variety of available plotting options it is particularly useful for the exploratory analysis of dependence structures. It can be further used for accurate nonparametric estimation of copula densities and resampling. The implementation features spline interpolation of the estimates to allow for fast evaluation of density estimates and integrals thereof. We utilize this for a fast renormalization scheme that ensures that estimates are bona fide copula densities and additionally improves the estimators' accuracy. The performance of the methods is illustrated by simulations.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77168954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-02-18DOI: 10.7551/mitpress/10761.003.0008
Chris J. Maddison
Simulating samples from arbitrary probability distributions is a major research program of statistical computing. Recent work has shown promise in an old idea, that sampling from a discrete distribution can be accomplished by perturbing and maximizing its mass function. Yet, it has not been clearly explained how this research project relates to more traditional ideas in the Monte Carlo literature. This chapter addresses that need by identifying a Poisson process model that unifies the perturbation and accept-reject views of Monte Carlo simulation. Many existing methods can be analyzed in this framework. The chapter reviews Poisson processes and defines a Poisson process model for Monte Carlo methods. This model is used to generalize the perturbation trick to infinite spaces by constructing Gumbel processes, random functions whose maxima are located at samples over infinite spaces. The model is also used to analyze A* sampling and OS*, two methods from distinct Monte Carlo families.
{"title":"A Poisson process model for Monte Carlo","authors":"Chris J. Maddison","doi":"10.7551/mitpress/10761.003.0008","DOIUrl":"https://doi.org/10.7551/mitpress/10761.003.0008","url":null,"abstract":"Simulating samples from arbitrary probability distributions is a major research program of statistical computing. Recent work has shown promise in an old idea, that sampling from a discrete distribution can be accomplished by perturbing and maximizing its mass function. Yet, it has not been clearly explained how this research project relates to more traditional ideas in the Monte Carlo literature. This chapter addresses that need by identifying a Poisson process model that unifies the perturbation and accept-reject views of Monte Carlo simulation. Many existing methods can be analyzed in this framework. The chapter reviews Poisson processes and defines a Poisson process model for Monte Carlo methods. This model is used to generalize the perturbation trick to infinite spaces by constructing Gumbel processes, random functions whose maxima are located at samples over infinite spaces. The model is also used to analyze A* sampling and OS*, two methods from distinct Monte Carlo families.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87913714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-02-16DOI: 10.4310/SII.2016.V9.N4.A3
Nathaniel E. Helwig, Ping Ma
In the current era of big data, researchers routinely collect and analyze data of super-large sample sizes. Data-oriented statistical methods have been developed to extract information from super-large data. Smoothing spline ANOVA (SSANOVA) is a promising approach for extracting information from noisy data; however, the heavy computational cost of SSANOVA hinders its wide application. In this paper, we propose a new algorithm for fitting SSANOVA models to super-large sample data. In this algorithm, we introduce rounding parameters to make the computation scalable. To demonstrate the benefits of the rounding parameters, we present a simulation study and a real data example using electroencephalography data. Our results reveal that (using the rounding parameters) a researcher can fit nonparametric regression models to very large samples within a few seconds using a standard laptop or tablet computer.
{"title":"Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters","authors":"Nathaniel E. Helwig, Ping Ma","doi":"10.4310/SII.2016.V9.N4.A3","DOIUrl":"https://doi.org/10.4310/SII.2016.V9.N4.A3","url":null,"abstract":"In the current era of big data, researchers routinely collect and analyze data of super-large sample sizes. Data-oriented statistical methods have been developed to extract information from super-large data. Smoothing spline ANOVA (SSANOVA) is a promising approach for extracting information from noisy data; however, the heavy computational cost of SSANOVA hinders its wide application. In this paper, we propose a new algorithm for fitting SSANOVA models to super-large sample data. In this algorithm, we introduce rounding parameters to make the computation scalable. To demonstrate the benefits of the rounding parameters, we present a simulation study and a real data example using electroencephalography data. Our results reveal that (using the rounding parameters) a researcher can fit nonparametric regression models to very large samples within a few seconds using a standard laptop or tablet computer.","PeriodicalId":8446,"journal":{"name":"arXiv: Computation","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86503373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}