In this paper, we derive new probability bounds for Chebyshev's inequality if the supremum of the probability density function is known.This result holds for one-dimensional or multivariate continuous probability distributions with finite mean and variance (covariance matrix).We also show that the similar result holds for specific discrete probability distributions.
{"title":"Improved Chebyshev inequality: new probability bounds with known supremum of PDF","authors":"T. Nishiyama","doi":"10.31219/osf.io/h9zfn","DOIUrl":"https://doi.org/10.31219/osf.io/h9zfn","url":null,"abstract":"In this paper, we derive new probability bounds for Chebyshev's inequality if the supremum of the probability density function is known.This result holds for one-dimensional or multivariate continuous probability distributions with finite mean and variance (covariance matrix).We also show that the similar result holds for specific discrete probability distributions.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126722283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The discrete Fourier transform (DFT) is a widely used tool across science and engineering. Nevertheless, the DFT assumes that the frequency characteristics of a signal remain constant over time, and is unable to detect local changes. Researchers beginning with Gabor (1946) addressed this shortcoming by inventing methods to obtain time-frequency representations, and this thesis focuses on one such method: the Sliding Window Discrete Fourier Transform (SWDFT). Whereas the DFT operates on an entire signal, the SWDFT takes an ordered sequence of smaller DFTs on contiguous subsets of a signal. The SWDFT is a fundamental tool in time-frequency analysis, and is used in a variety of applications, such as spectrogram estimation, image enhancement, neural networks, and more. This thesis studies the SWDFT from three perspectives: algorithmic, statistical, and applied. Algorithmically, we introduce the Tree SWDFT algorithm, and extend it to arbitrary dimensions. Statistically, wederive the marginal distribution and covariance structure of SWDFT coefficients for white noise signals, which allows us to characterize the SWDFT coefficients as a Gaussian process with a known covariance. We also propose a localized version of cosine regression, and show that the approximate maximum likelihood estimate of the frequency parameter in this model is the maximum SWDFT coefficient over all possible window sizes. From an applied perspective, we introduce a new algorithm to decompose signals with multiple non-stationary periodic components, called matching demodulation. We demonstrate the utility of matching demodulation in an analysis of local field potential recordings from a neuroscience experiment.
{"title":"The Sliding Window Discrete Fourier Transform","authors":"Lee F. Richardson, W. Eddy","doi":"10.1184/R1/8191937.V1","DOIUrl":"https://doi.org/10.1184/R1/8191937.V1","url":null,"abstract":"The discrete Fourier transform (DFT) is a widely used tool across science and engineering. Nevertheless, the DFT assumes that the frequency characteristics of a signal remain constant over time, and is unable to detect local changes. Researchers beginning with Gabor (1946) addressed this shortcoming by inventing methods to obtain time-frequency representations, and this thesis focuses on one such method: the Sliding Window Discrete Fourier Transform (SWDFT). Whereas the DFT operates on an entire signal, the SWDFT takes an ordered sequence of smaller DFTs on contiguous subsets of a signal. The SWDFT is a fundamental tool in time-frequency analysis, and is used in a variety of applications, such as spectrogram estimation, image enhancement, neural networks, and more. This thesis studies the SWDFT from three perspectives: algorithmic, statistical, and applied. Algorithmically, we introduce the Tree SWDFT algorithm, and extend it to arbitrary dimensions. Statistically, wederive the marginal distribution and covariance structure of SWDFT coefficients for white noise signals, which allows us to characterize the SWDFT coefficients as a Gaussian process with a known covariance. We also propose a localized version of cosine regression, and show that the approximate maximum likelihood estimate of the frequency parameter in this model is the maximum SWDFT coefficient over all possible window sizes. From an applied perspective, we introduce a new algorithm to decompose signals with multiple non-stationary periodic components, called matching demodulation. We demonstrate the utility of matching demodulation in an analysis of local field potential recordings from a neuroscience experiment.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134040479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-28DOI: 10.22237/jmasm/1604189940
C. Chatzipantsiou, Marios Dimitriadis, M. Papadakis, M. Tsagris
Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we treat the case of Pearson correlation coefficient and two independent samples t-test. We propose a highly computationally efficient method for calculating permutation based p-values in these two cases. The method is general and can be applied or be adopted to other similar two sample mean or two mean vectors cases.
{"title":"JMASM 52: Extremely Efficient Permutation and Bootstrap Hypothesis Tests Using R","authors":"C. Chatzipantsiou, Marios Dimitriadis, M. Papadakis, M. Tsagris","doi":"10.22237/jmasm/1604189940","DOIUrl":"https://doi.org/10.22237/jmasm/1604189940","url":null,"abstract":"Re-sampling based statistical tests are known to be computationally heavy, but reliable when small sample sizes are available. Despite their nice theoretical properties not much effort has been put to make them efficient. In this paper we treat the case of Pearson correlation coefficient and two independent samples t-test. We propose a highly computationally efficient method for calculating permutation based p-values in these two cases. The method is general and can be applied or be adopted to other similar two sample mean or two mean vectors cases.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"188 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116452507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-06-27DOI: 10.1007/978-3-319-99447-5_30
A. Gorshenin, V. Korolev
{"title":"A functional approach to estimation of the parameters of generalized negative binomial and gamma distributions","authors":"A. Gorshenin, V. Korolev","doi":"10.1007/978-3-319-99447-5_30","DOIUrl":"https://doi.org/10.1007/978-3-319-99447-5_30","url":null,"abstract":"","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129797917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susheela P. Singh, Ana-Maria Staicu, R. Dunn, N. Fierer, B. Reich
The advent of high-throughput sequencing technologies has made data from DNA material readily available, leading to a surge of microbiome-related research establishing links between markers of microbiome health and specific outcomes. However, to harness the power of microbial communities we must understand not only how they affect us, but also how they can be influenced to improve outcomes. This area has been dominated by methods that reduce community composition to summary metrics, which can fail to fully exploit the complexity of community data. Recently, methods have been developed to model the abundance of taxa in a community, but they can be computationally intensive and do not account for spatial effects underlying microbial settlement. These spatial effects are particularly relevant in the microbiome setting because we expect communities that are close together to be more similar than those that are far apart. In this paper, we propose a flexible Bayesian spike-and-slab variable selection model for presence-absence indicators that accounts for spatial dependence and cross-dependence between taxa while reducing dimensionality in both directions. We show by simulation that in the presence of spatial dependence, popular distance-based hypothesis testing methods fail to preserve their advertised size, and the proposed method improves variable selection. Finally, we present an application of our method to an indoor fungal community found with homes across the contiguous United States.
{"title":"A nonparametric spatial test to identify factors that shape a microbiome","authors":"Susheela P. Singh, Ana-Maria Staicu, R. Dunn, N. Fierer, B. Reich","doi":"10.1214/19-aoas1262","DOIUrl":"https://doi.org/10.1214/19-aoas1262","url":null,"abstract":"The advent of high-throughput sequencing technologies has made data from DNA material readily available, leading to a surge of microbiome-related research establishing links between markers of microbiome health and specific outcomes. However, to harness the power of microbial communities we must understand not only how they affect us, but also how they can be influenced to improve outcomes. This area has been dominated by methods that reduce community composition to summary metrics, which can fail to fully exploit the complexity of community data. Recently, methods have been developed to model the abundance of taxa in a community, but they can be computationally intensive and do not account for spatial effects underlying microbial settlement. These spatial effects are particularly relevant in the microbiome setting because we expect communities that are close together to be more similar than those that are far apart. In this paper, we propose a flexible Bayesian spike-and-slab variable selection model for presence-absence indicators that accounts for spatial dependence and cross-dependence between taxa while reducing dimensionality in both directions. We show by simulation that in the presence of spatial dependence, popular distance-based hypothesis testing methods fail to preserve their advertised size, and the proposed method improves variable selection. Finally, we present an application of our method to an indoor fungal community found with homes across the contiguous United States.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126108585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Confidence ellipsoids for linear regression coefficients are constructed by observations from a mixture with varying concentrations. Two approaches are discussed. The first one is the nonparametric approach based on the weighted least squares technique. The second one is an approximate maximum likelihood estimation with application of the EM-algorithm for the estimates calculation.
{"title":"Confidence ellipsoids for regression coefficients by observations from a mixture","authors":"V. Miroshnichenko, R. Maiboroda","doi":"10.15559/18-VMSTA105","DOIUrl":"https://doi.org/10.15559/18-VMSTA105","url":null,"abstract":"Confidence ellipsoids for linear regression coefficients are constructed by observations from a mixture with varying concentrations. Two approaches are discussed. The first one is the nonparametric approach based on the weighted least squares technique. The second one is an approximate maximum likelihood estimation with application of the EM-algorithm for the estimates calculation.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126140623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider bandwidth matrix selection for kernel density estimators (KDEs) of density level sets in $mathbb{R}^d$, $d ge 2$. We also consider estimation of highest density regions, which differs from estimating level sets in that one specifies the probability content of the set rather than specifying the level directly. This complicates the problem. Bandwidth selection for KDEs is well studied, but the goal of most methods is to minimize a global loss function for the density or its derivatives. The loss we consider here is instead the measure of the symmetric difference of the true set and estimated set. We derive an asymptotic approximation to the corresponding risk. The approximation depends on unknown quantities which can be estimated, and the approximation can then be minimized to yield a choice of bandwidth, which we show in simulations performs well. We provide an R package lsbs for implementing our procedure.
{"title":"Bandwidth selection for kernel density estimators of multivariate level sets and highest density regions","authors":"Charles R. Doss, Guangwei Weng","doi":"10.1214/18-EJS1501","DOIUrl":"https://doi.org/10.1214/18-EJS1501","url":null,"abstract":"We consider bandwidth matrix selection for kernel density estimators (KDEs) of density level sets in $mathbb{R}^d$, $d ge 2$. We also consider estimation of highest density regions, which differs from estimating level sets in that one specifies the probability content of the set rather than specifying the level directly. This complicates the problem. Bandwidth selection for KDEs is well studied, but the goal of most methods is to minimize a global loss function for the density or its derivatives. The loss we consider here is instead the measure of the symmetric difference of the true set and estimated set. We derive an asymptotic approximation to the corresponding risk. The approximation depends on unknown quantities which can be estimated, and the approximation can then be minimized to yield a choice of bandwidth, which we show in simulations performs well. We provide an R package lsbs for implementing our procedure.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128486075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finite mixtures are a flexible modeling tool for irregularly shaped densities and samples from heterogeneous populations. When modeling with mixtures using an exchangeable prior on the component features, the component labels are arbitrary and are indistinguishable in posterior analysis. This makes it impossible to attribute any meaningful interpretation to the marginal posterior distributions of the component features. We propose a model in which a small number of observations are assumed to arise from some of the labeled component densities. The resulting model is not exchangeable, allowing inference on the component features without post-processing. Our method assigns meaning to the component labels at the modeling stage and can be justified as a data-dependent informative prior on the labelings. We show that our method produces interpretable results, often (but not always) similar to those resulting from relabeling algorithms, with the added benefit that the marginal inferences originate directly from a well specified probability model rather than a post hoc manipulation. We provide asymptotic results leading to practical guidelines for model selection that are motivated by maximizing prior information about the class labels and demonstrate our method on real and simulated data.
{"title":"Anchored Bayesian Gaussian mixture models","authors":"D. Kunkel, M. Peruggia","doi":"10.1214/20-ejs1756","DOIUrl":"https://doi.org/10.1214/20-ejs1756","url":null,"abstract":"Finite mixtures are a flexible modeling tool for irregularly shaped densities and samples from heterogeneous populations. When modeling with mixtures using an exchangeable prior on the component features, the component labels are arbitrary and are indistinguishable in posterior analysis. This makes it impossible to attribute any meaningful interpretation to the marginal posterior distributions of the component features. We propose a model in which a small number of observations are assumed to arise from some of the labeled component densities. The resulting model is not exchangeable, allowing inference on the component features without post-processing. Our method assigns meaning to the component labels at the modeling stage and can be justified as a data-dependent informative prior on the labelings. We show that our method produces interpretable results, often (but not always) similar to those resulting from relabeling algorithms, with the added benefit that the marginal inferences originate directly from a well specified probability model rather than a post hoc manipulation. We provide asymptotic results leading to practical guidelines for model selection that are motivated by maximizing prior information about the class labels and demonstrate our method on real and simulated data.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126166024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Starling, Jared S. Murray, C. Carvalho, R. Bukowski, J. Scott
This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate t, while not necessarily requiring smoothness over other covariates x. TsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. TsBART extends BART by parameterizing each tree's terminal nodes with smooth functions of t, rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps regularize the estimate. After introducing and benchmarking the tsBART model, we apply it to our motivating example: pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age (t), based on maternal and fetal risk factors (x). Obstetricians expect stillbirth risk to vary smoothly over gestational age, but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of perinatal mortality. All methods described here are implemented in the R package tsbart.
{"title":"BART with targeted smoothing: An analysis of patient-specific stillbirth risk","authors":"Jennifer Starling, Jared S. Murray, C. Carvalho, R. Bukowski, J. Scott","doi":"10.1214/19-aoas1268","DOIUrl":"https://doi.org/10.1214/19-aoas1268","url":null,"abstract":"This article introduces BART with Targeted Smoothing, or tsBART, a new Bayesian tree-based model for nonparametric regression. The goal of tsBART is to introduce smoothness over a single target covariate t, while not necessarily requiring smoothness over other covariates x. TsBART is based on the Bayesian Additive Regression Trees (BART) model, an ensemble of regression trees. TsBART extends BART by parameterizing each tree's terminal nodes with smooth functions of t, rather than independent scalars. Like BART, tsBART captures complex nonlinear relationships and interactions among the predictors. But unlike BART, tsBART guarantees that the response surface will be smooth in the target covariate. This improves interpretability and helps regularize the estimate. \u0000After introducing and benchmarking the tsBART model, we apply it to our motivating example: pregnancy outcomes data from the National Center for Health Statistics. Our aim is to provide patient-specific estimates of stillbirth risk across gestational age (t), based on maternal and fetal risk factors (x). Obstetricians expect stillbirth risk to vary smoothly over gestational age, but not necessarily over other covariates, and tsBART has been designed precisely to reflect this structural knowledge. The results of our analysis show the clear superiority of the tsBART model for quantifying stillbirth risk, thereby providing patients and doctors with better information for managing the risk of perinatal mortality. All methods described here are implemented in the R package tsbart.","PeriodicalId":186390,"journal":{"name":"arXiv: Methodology","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132485321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}